Context-aware Recommendation System for Arabic Cultural Topics

كتبت بواسطة محمد نعامة | بتاريخ الأربعاء 25 اذار 2020

I.  INTRODUCTION
Recommender systems are tools for interacting with large and complex information spaces. They provide a personalized view of such spaces, prioritizing items likely to be of interest to the user. The field, christened in 1995, has grown enormously in the variety of problems addressed and techniques employed, as well as in its practical applications. Recommender systems research has incorporated a wide variety of artificial intelligence techniques including machine learning, data mining, user modelling, case-based reasoning, and constraint satisfaction, among others. Personalized recommendations are an important part of many online ecommerce applications. This wealth of practical application experience has provided inspiration to researchers to extend the reach of recommender systems into new and challenging areas. This article provides an overview of the current state of the field .Recommender systems help overcome information overload by providing personalized suggestions based on a history of a user’s likes and dislikes. Many on-line stores provide recommending services e.g. Amazon, CDNOW, Barnes And-Noble, IMDb, etc. There are two prevalent approaches to building recommender systems — Collaborative Filtering (CF) and Content-based (CB) recommending. CF systems work by collecting user feedback in the form of ratings for items in a given domain and exploit similarities and differences among profiles of several users in determining how to recommend an item. On the other hand, content-based methods provide recommendations by comparing representations of content contained in an item to representations of content that interests the user.
Content-based methods can uniquely characterize each user, but CF still has some key advantages over them [1][2].

Firstly, CF can perform in domains where there is not much content associated with items, or where the content is difficult for a computer to analyze — ideas, opinions etc. Secondly a CF system has the ability to provide serendipitous recommendations, i.e. it can recommend items that are relevant to the user, but do not contain content from the user’s profile. Because of these reasons, CF systems have been used successfully to build recommender systems in various domains.

I.  The Long Tail
Before discussing the principal applications of recommendation systems, let us ponder the long tail phenomenon that makes recommendation systems necessary. Physical delivery systems are characterized by a scarcity of resources.
Brick-and-mortar stores have limited shelf space, and can show the customer only a small fraction of all the choices that exist. On the other hand, on-line stores can make anything that exists available to the customer. Thus, a physical bookstore may have several thousand books on its shelves, but Amazon offers millions of books. A physical newspaper can print several dozen articles per day, while on-line news services offer thousands per day.
Recommendation in the physical world is simple. First, it is not possible to tailor the store to each individual customer. Thus, the choice of what is made available is governed only by the aggregate numbers. Typically, a bookstore will display only the books that are most popular, and a newspaper will print only the articles it believes the most people will be interested in.
In the first case, sales figures govern the choices, in the second case, editorial judgement serves.
The distinction between the physical and on-line worlds has been called the long tail phenomenon, and it is suggested in Fig. 9.2. The vertical axis represents popularity (the number of times an item is chosen). The items are ordered on the horizontal axis according to their popularity. Physical institu-tions provide only the most popular items to the left of the vertical line, while the corresponding on-line institutions provide the entire range of items: the tail as well as the popular items.
The long-tail phenomenon forces on-line institutions to recommend items to individual users. It is not possible to present all available items to the user, the way physical institutions can. Neither can we expect users to have heard of each of the items they might like.


 

II.                    RESEARCH MOTIVATION
Humans are quite successful in communicating ideas and reacting appropriately, due to several factors including:

·A person uses some language rich in meanings and expressions to communicate a particular idea. And A common understanding for how to formulate the words of this language in the expression and through the attitudes of life they go through.
·In the traditional ways of interaction, users have poor access to the information provided to the computer. As a result, computers will not be able to take full advantage of any information to support dialogue with users. By improving the ability of the computer to access to the content we are increasing interaction and communication between these devices and users and thus create services that are more effective in the computer world.
This research aims to solve the problem of displaying preferences for cultural topics in Arabic for the user by merge between two Algorithms (Collaborative Filtering and Content-based) to create hybrid recommended system.[3]

III.         Arabic Language Characteristics
Arabic is a Semitic language, which differs from Indo-European languages syntactically, morphologically and semantically. The writing system of Arabic has twenty-five consonants and three long vowels that are written from right to left and take different shapes according to their position in the word. In addition, Arabic has short vowels, which are not part of the alphabet, they are written as vowel diacritics above or under a consonant to give it its desired sound, hence they give a word a desired meaning. Texts without vowels are considered more appropriate by the Arabic-speaking community since they are widely used in everyday written and printed materials (books, magazines, newspapers, letters, etc.) [4][5]. However, in Holy Koran, printed collections of classical poetry, school books and some Arabic paper dictionaries, vowel diacritics appear in full .

Usually, well-edited books, some printed texts, and anuscripts vowel diacritics partially or randomly are written out where words could be ambiguous or difficult to read.

Arabic is a very rich and complex language and the morphological representation of Arabic is rather complex.

Arabic language is based on set of roots, therefore all nouns and verbs are generated from a set of roots [6]. This set contains 11,347 roots distributed as follow:


·       115: Two character roots (and these roots have no derivations from them)

·       7198: Three character roots.

·       3739: Four character roots.

·       295: Five character roots.

These roots join with various vowel patterns to form simple nouns and verbs to which affixes can be attached for more complicated derivations. Patterns play an important role in Arabic lexicography and morphology. For example, the root “لعب” corresponds to the pattern “فعل”, where other letters can be added to form another pattern. For example: the pattern “مفعل” form the word “ملعب” by adding the letter “م” to the morpheme “لعب”.

Dual or plural pattern are composed by adding suffixes like “ان “،” ين” for dual and “ون “،” ين” ات"،” for plural. There are irregular forms of plural patterns called broken plural. In this case, a noun in plural takes another morphological form different from its initial form in singular, (Table

1, shows singular and plural patterns), In some cases a plural pattern may have more than one

singular patterns as shown in table1. For example, the word “أطباء” has a singular “طبيب” which is like “فعيل” but the word “عقلاء” has a singular “عاقل” which is like “فاعل”.

Table 1. Singular and plural patterns.

Plural Pattern

Singular Pattern

مفاعل

مفعل

مفاعيل

مفعول

أفعال

فعل

فعلاء

فعل،فعال،فاعل،فعيل

فعال

فاعل

أفعل

فعل

أفعلة

فعيل،فعال

فواعل

فوعل،فاعل


Some of the broken plural forms like “فعائل” ، “فعالل” ، “فعلة” has many singular patterns. There are no general rules to cover them all, and it is hard to deal with them without having a complete dictionary for these plurals.[10]


IV.         Arabic Wordnet

Arabic WordNet (AWN) is based on the design and contents of the Princeton WordNet (PWN). The success of the PWN for English has motivated similar projects that aim to develop wordnets for other languages. That project aimed to develop a linguistic resource with a deep formal semantic foundation in order to capture the richness of Arabic as described in.

The Arabic WordNet has been built following methods that developed Euro WordNet. In addition, word meanings are defined with a machine understandable semantics in first order logic. The basis for this semantics is the Suggested Upper Merged Ontology (SUMO) and its associated domain ontologies, SUMO is being enlarged to provide a formal semantic foundation for AWN.

The AWN project provides a deep semantic underpinning for each concept. The approach that [7][8]was previously used in mapping all of PWN to a formal ontology is considered (Figure 1).







Figure 1: SUMO mapping to wordnets


AWN is not designed to take any Arabic text as an input directly. However, we have to transliterate convert the Arabic system text to be fed to as an input to the system. The results must be transliterated converted back into Arabic to be understood[9]. This technique was introduced by Buckwalter (2002), for example the letter “و” is transliterate to “w”. The letter “و” is transliterate to “&”.

V.             Collaborative Filtering


Collaborative filtering (CF) is a popular recommendation algorithm that bases its predictions and recommendations on the ratings or behavior of other users in the system. The fundamental assumption behind this method is that other users’ opinions can be selected and aggregated in such a way as to provide a reasonable prediction of the active user’s

preference. Intuitively, they assume that, if users agree about the quality or relevance of some items, then they will likely agree about other items — if a group of users likes the same things as Mary, then Mary is likely to like the things they like which she hasn’t yet seen [11].

There are other methods for performing recommendation, such as finding items similar to the items liked by a user using textual similarity in metadata (content-based filtering or CBF).

The majority of collaborative filtering algorithms in service today, operate by first generating predictions of the user’s preference and then produce their recommendations by ranking candidate items by predicted preferences.[14]

Often this prediction is in the same scale as the ratings provided by users, but occasionally the prediction is on a different scale and is meaningful only for candidate ranking. This strategy is analogous to the common information retrieval method of producing relevance scores for each document in a corpus with respect to a particular query and presenting the top-scored items. Indeed, the recommend task can be viewed as an information retrieval problem in which the domain of items (the corpus) is queried with the user’s preference profile. [15]

الشدة

ألف ممدودة

السكون

o

ألف الوصل

{

حرف الباء

b

ألف مقصورة

Y

حرف العين

e

الفتحة

a

حرف الفاء

f

الضمة

u

الكسرة

i

Table2.shows some Arabic letters and their transliteration in Buckwalter


 



VI.         Computing User Similarity
A critical design decision in implementing user–user CF is the choice of similarity function. Several different similarity functions have been proposed and evaluated in the literature.
Pearson correlation. This method computes the statistical correlation (Pearson’s r) between two user’s common ratings to determine their similarity. GroupLens and BellCore both used this method [17 ]. The correlation is computed by the following:


Pearson correlation suffers from computing high similarity
between users with few ratings in common. This can be alleviated by setting a threshold on the number of co-rated items necessary for full agreement (correlation of 1) and scaling the similarity when the number of co-rated items falls below this threshold [18 ]. Experiments have shown a threshold value of 50 to be useful in improving prediction accuracy, and the threshold can be applied by multiplying the similarity function by min{|Iu∩Iv|/50,1}.

Cosine similarity. This model is somewhat different than the previously described approaches, as it is a vector-space approach based on linear algebra rather than a statistical approach. Users are represented as |I|-dimensional vectors and similarity is measured by the cosine distance between two rating vectors. This can be computed efficiently by taking their dot product and dividing it by the product of their L2 (Euclidean) norms:



Unknown ratings are considered to be 0; this causes them to effectively drop out of the numerator. If the user mean baseline is subtracted from the ratings prior to computing the similarity, cosine similarity is equivalent to Pearson correlation when the users have rated the same set of items and decreases as |Iu∩Iv|2 / |Iu||Iv| decreases. [12][13]


VII.     Content-based filtering
 
Recommender systems are a special type of information filtering systems. Information filtering deals with the delivery of items selected from a large collection that the user is likely to find interesting or useful and can be seen as a classification
task. Based on training data a user model is induced that enables the filtering system to classify unseen items into a positive class c (relevant to the user) or a negative class
c (irrelevant to the user). The training set consists of the items that the user found interesting. These items form training instances that all have an attribute. This attribute specifies the class of the item based on either the rating of the user or on
implicit evidence. Formally, an item is described as a vector
X= (x1, x2 ..., xn) of n. [16]
components. The components can have binary, nominal or numerical attributes and are derived from either the content of the items or from information about the users’ preferences. The task of the learning method is to select a function based on a training set of m input vectors that can classify any item in the collection. The function h (X) will either be able to classify an unseen item as positive or negative at once by returning a binary value or return a numerical value. In that case a threshold can be used to determine if the item is relevant or irrelevant to the user.
A content-based filtering system selects items based on the correlation between the content of the items and the user’s preferences as opposed to a collaborative filtering system that chooses items based on the correlation between people with similar preferences. PRES is a content-based filtering system. It makes recommendations by comparing a user profile with the content of each document in the collection. The content of a document can be represented with a set of terms. Terms are extracted from documents by running through a number of parsing steps. First all HTML tags and stop words  [19](words that occur very often and cannot be used as discriminators) are
removed. The remaining words are reduced to their stem by removing prefixes and suffixes [Porter 1980]. For instance, the words “computer”, “computers” and “computing” could all be reduced to “comput”. The user profile is represented with the same terms and built up by analyzing the content of documents that the user found interesting. Which documents the user found interesting can be determined by using either explicit or implicit feedback. Explicit feedback requires the user to evaluate examined documents on a scale. In implicit feedback the user’s interests are inferred by observing the user’s actions, which is more convenient for the user but more difficult to implement.[20]
There are several ways in which terms can be represented in order to be used as a basis for the learning component. A representation method that is often used is the vector space model. In the vector space model a document D is represented as an m-dimensional vector, where each dimension corresponds to a distinct term and m is the total number of terms used in the collection of documents. The document vector is written as, where (wi)is the weight of term (ti) that indicates its importance. If document D does not contain term (ti) then weight (wi) is zero. Term weights can be determined by using the (tf-idf) scheme. In this approach the terms are assigned a weight that is based on how often a term appears in a particular document and how frequently it occurs in the entire document collection: [21]


where (tfi) is the number of occurrences of term (ti) in document D, n the total number of documents in the collection and (dfi) the number of documents in which term (ti)
appears at least once. The assumptions behind (tf-idf) are based on two characteristics of text documents. First, the more times a term appears in a document, the more relevant it is to the topic of the document. Second, the more times a term occurs in all documents in the collection, the more poorly it discriminates between documents.[22]
In the vector space model user profiles can be represented just like documents by one or more profile vectors. The degree of similarity between a profile vector P, where P= (u1,….uk) can be determined by using the cosine measure:

VIII. ISRI Stemmer
 

In 2005 Kazem T et al proposed an algorithm for Arabic stemming shares many features with the Khoja stemmer ]

The Information Science Research Institute’s (ISRI) stemmer [9] uses a similar approach to word rooting as the Khoja stemmer, but does not employ a root dictionary for lookup. Additionally, if a word cannot be rooted, the ISRI stemmer normalizes the word and returns a normalized form (for example, removing certain determinants and end patterns) instead of leaving the word unchanged.

The ISRI stemmer follows this procedure:

1.    Remove diacritics representing vowelization.

2.    Remove stopwords, punctuation, and numbers.

3.    Remove definite article “ال”

4.    Remove inseparable conjunction “و”.

5.    Remove suffixes

6.    Remove prefixes.

7.    Match result against a list of patterns. If a match is found, then extract the characters in the pattern representing the root.

8.    Match the extracted root against a list of known "valid" roots.

9.    Replace weak letters “أ،و،يwith “و”

10.                       Replace all occurrences of hamza “ؤ،ئ،ء”with “أ”

11.                       Two letter roots are checked to see if they should contain a double character. If so, the character is added to the root.


Figure 2 shows the Flowchart of ISRI Stemmer


It defines sets of diacritical marks and affix classes as shown in Table 3. These are sets of marks are removed by the stemmer. In addition, it defines some pattern sets as shown in Table 4


Table 3. Affix Sets



Table 4. Arabic Patterns and Roots




Additional Stemming proceeds in the following steps:

1.    Remove diacritics representing vowels.

2.    Normalize the hamza which appears in several distinct forms in combination with various

      letters to one form “أ”.

3.    Remove length three and length two prefixes respectively.

4.    Remove connector “_” if it precedes a word beginning with “و”.

5.    Normalize “إ،أ،آ”to “ا”.

6.    Return stem if less than or equal to three.

7.    Consider four cases depending on the length of the word:

a)    Length = 4: If the word matches one of the patterns from PR4 (Table 4), extract the relevant stem and return it. Otherwise, attempt to remove length-one suffixes and prefixes from S1 and P1 in that order to get a word not less than three letters in length.


b)   Length = 5: Extract stems with three characters for words that match patterns from PR53.If none are matched, attempt to remove suffixes and prefixes. Otherwise the relevant length-three stem is returned. If the word is still five characters in length, the word is matched against PR54 to determine if it contains any stems of length 4. The relevant stem is returned if found.[23]


c)    Length = 6. Extract stems of length three if the word matches a pattern from PR63.Otherwise, attempt to remove suffixes. If a suffix is removed and a resulting term is of length five letters, send the word back through step 7b. Otherwise, attempt to remove one character prefixes. If successful, then send the resulting length-five term to step 7b.


d)   Length = 7. Attempt to remove one-character suffixes and prefixes. If it is successful, send the resulting length-six term to step 7c.

The ISRI stemmer has been shown to give good improvements to language tasks such as document clustering, as opposed to a non-stemmed approach[24]

IX. Hybrid System

we consider a hybrid system contain content based filtering and Collaborative filtering, in the hope of combining the best of both worlds. In our hybrid system approach, we view Arabic Cultural Topics as source of information or locating the nearest neighbors of users and items in Collaborative filtering, instead of only looking at the overlap in items that two users have in common when calculating user similarities, we can use the overlap in Cultural Topics applied to items to determine the most similar neighbors. Users that describe their profile items using the same terminology are likely share the same interests, making them a good source of recommendations.[25]

This is similar to the way we used the tag clouds of users and items to calculate similarity between users and items. The user and item similarities we derive in this way are then plugged into the Collaborative filtering algorithms. The resulting algorithm is a feature-augmented hybrid of Collaborative filtering and content-based filtering.

our Hybrid system consists of two steps: (1) calculating the most similar document for the active


user by using content based filtering, and (2) using the Collaborative filtering for neighbors whom selected the same topic like the active user to predict item ratings. The latter prediction step is performed in the same manner as described in Figure 3

Figure 3 the hybrid system


we use the ISRI algorithm to root the words and store them in the inverted index Figure 4


Figure 4 the hybrid system processing

X.    Results
 
Figure 5 contains the best runs for each of the two algorithms, as well as our best CF run what we see, if use each algorithm in separately the result will be different in three norms the performance, usage, and improvement.
The Hybrid System which use CF, and CBF better than and give more accurate results especially with Arabic language because until now there are no Recommendation System for Arabic topics.
In general, we observe that the Hybrid System approach tends to outperform with Arabic Topics This improvement is statistically significant for the Arabic data with an improvement of 20%








Figure 5 Results comparison of the best Recommendation System runs with collaborative Filtering, Content-based filtering and Hybrid System with Arabic topics. The percentage difference between our Hybrid System and another system use the CF, and CBF separately

I.  CONCLUSIONS
In this paper we have presented a range of collaborative and content-based approaches to item recommendation for Arabic Cultural Topics.
Our recommendation was evaluated on Arabic Cultural Topics realistic data that binds users and items together. These tags can be used successfully to improve the recommendations of standard nearest-neighbor algorithms, but this depends on the algorithm. For item-based filtering, using tags for calculating item similarity alleviates sparsity and results in better performance. At the user level, however, tags do not offer the same benefits.