This paper addresses the techniques used to analyze web pages and in-text in general to outline effective strategies in junk filtering. The paper as such investigates the effectiveness of traditional and modern junk classifiers to bring out the present state-of-the-artwork. The paper also shows the basic attributes of the web page that could be described as spam. Some techniques suggestions finalize this paper with a general recommendation conclusion.
The World Wide Web has proved to be an effective means to acquire a number of texts that can be used for this purpose. However, by now, the amount of junk content over the Internet has also multiplied that has made it difficult for users to locate required texts. Here is the role of junk filter which should filter out all unwanted content. There are quite a number of studies that have up to the present focused on the issue of classifying junk reading material from the huge amount of articles and news items lying on millions of websites over the Internet.
Modern Text Analysis
Most of the directions used in classifying texts on the web were based on classifications of the grade or the level of the texts. According to Schwarm & Ostendorf, a number of traditionally used techniques to assess levels of reading merely focus on simple approximations of grammatical or syntactical complexity like the length of a sentence. For instance, the most popular Flesch-Kincaid Grade Level index in a classifier examines word number of syllables and the average length of sentences in a given text. Likewise, there is another index called the Gunning Fog, which bases on “the average number of words per sentence and the percentage of words with three or more syllables” (Reading level, 524). Although the methods can easily and quickly calculate the required data of words, they extend problems for a number of technical issues. For example, syntactic complexity cannot necessarily be determined by merely measuring the length of a sentence, and counting syllables in a word does not necessarily show the level of difficulty of a word. Other methods use the semantics approach to maintain readability by measuring the frequency of a word in a text with respect to a corpus or reference. The Dale-Chall is one example in this regard. The Lexile is another instance. Moreover, recent research in universities has helped develop some more tools such as a “smoothed unigram” classifier for better results. The most commonly used statistical language model Natural Language Processing (NLP) technique for classifying text is the n-gram model. Also, there are Support Vector Machines (SVMs) used to classify texts. They employed the following features:
- average length sentences;
- per word syllable with average number;
- Flesch-Kincaid score;
- vocabulary rate score;
- parse features.
The result was that the trigram model was far more accurate than bigram or monogram models.
In 2007, Peterson, in her doctoral dissertation employed a number of methods, both the above-mentioned and a few others, to weigh different classifying techniques as well as to attempt to bring new methods for better classifying of NLP. When it comes to the assessment of web pages, (Petersen, Ostendorf, Text simplification) informs us of the experiments that they did to filter web pages. They retrieved some topics for the Google search engine and employed easy-exceptionable heuristic filters to clear text. They eliminated the HTML tagging and filtered the texts and kept only contiguous chunks of texts that consisted of sentences whose out of vocabulary rate (OOV) rate was lower than fifty percent. This is in relation to the 36k English words general-purpose approach. Next, they ignored texts that did not have “without at least 50 words remaining and applied the reading level detectors described in the previous section to the remaining texts” (2). Eventually, the sorting of the web pages was handed over to expert annotators to analyze whether or not the sorted web pages met the required degree of selection for a specific set of students. Then the researchers brought some changes after the observation that was made with the help of the annotators. They used heuristics to filter web pages. This time they continued to filter sentences that have OOV rate higher than fifty percent “but no longer stipulate that all remaining sentences in an article must be from a contiguous chunk” (2). They also filtered those lines that had fewer than five words each; by this way, a number of chunks like menu bar items, titles, and other junk not being part of original texts were wiped out. However, the heuristics filtered only a small amount of unwanted junk. Then they designed a Naïve Bayes Classifier that was used to distinguish content from junk. They found more improved results this time before they had applied the reading level detectors. To them it is highly necessary to do so otherwise the returning content would carry junk. Moreover, stricter topic filtering is another important measure needed for better filtering.
Another direction of filtering results and identifying the junk in web material could be used. The above-mentioned examples show more of syntax’s analysis of the resulted material. There should be a distinction of a junk based on query or “usefulness” of the outcome and the unrelated to the query results, or web spam results.
As the results show that the user usually does not look beyond the first pages of search results, the web spammer increases the rating of their junk sites to be placed among the first search batches. Thus the applying of the classifier which excludes the content based on spam non-spam should be applied first before classifying the content based on the query.
Spam Based on Content
For indicating a page being spam based on the specification we can find these factors connected to the page being classified as spam:
- A number of words on the page. (This parameter usually is connected to the excessive usage of words in pages as spammers can overload the page with different irrelevant words to match most of the queries given).
- A number of words on the title page. ( Similar parameter, as the search engines, consider the presence of the query word in the title page, so putting a large number of words in the title page can match a bigger number in search queries.)
- The average length of the words. ( This parameter affects the common use of long words which can be misspelled by the search user, as a result, the presence of long words which are usually two or three words with the space omitted can increase the probability of the page being spam).
- Amount of anchor text. (The amount of anchor text in a page can be used to serve as a link to another page, so some pages in the results may match a query but will consist of a link to another page.)
- The fraction of visible content. (As some junk pages were made merely to be indexed in the search engines, the amount of visible data is not present much useful information, and the search engines can decide the nature of the page and its relevance based on the comments in the body of the page or meta tags in the header. So the ratio of the non-markup content and the whole page size can be used to classify the web page as junk or not.)
- The compression ratio (the ratio between the uncompressed and compressed page, this usually may point is there a duplicate in the page, as the repeating the content can increase the frequency of the query word and the compression usually saves only one copy and the other being as a reference.)(Perkins).
The above-mentioned factors would be applied to the classifier as training data merely to identify the junk based on the appearance and content factors, and it should be mentioned the most of these factors cannot give decent results if used alone but the combination of them which will be used on training data should be satisfactory.
Spam Based on Host
The previously mentioned features were connected on html content filtering, and another approach that can be found useful in assigning to the classifier features based on the hostname, so the features the classifier are going to be trained on represent:
- Of the name of the host to be classified.
- Of incoming links.
- Of outgoing links.
In explaining these attributes, it should be mentioned that search engines assign high importance to URLs that contain query words, so classification of the hostname and if it leads to the same IP address, it could be spam. E.g. and both hostnames could map to the same IP that could spam.
The incoming and the analysis of the outgoing link is basically due to the search engines assigning high ranks to web pages based on the importance of out-ingoing links to this page. (Perkins) And by analyzing the links using the principle spam page does not lead to a non-spam page, the mythical importance of the search engine rank could be omitted. ( Fetterly, Manasse & Najork, Spam, damn spam).
In this case, we will be using another naïve Bayes classifier this time on the basis of the query word. The Bayes classifier will be classified into two categories which are going to be (relevant-irrelevant). The feature of this classifier and the method of working will be as followed:
Relevant- Irrelevant Features
|The feature||Distinguishable states||Explanation|
|Occurrence of the keyword in the title of the document||1) 0 occurrences |
2) 1 or more occurrences
|The existence or the absence of the keyword affects the relevance accordingly/|
|Occurrence of the keyword in the brief description||1) 0 occurrences |
2) exactly once
|The occurrence of the keywords in the first 25-50 words of the summary raises the relevance; the absence of the keyword does not affect the relevance|
|Multiple occurrences of the keyword in the description||1) Less than 2 occurrences. |
2) 2 and more occurrences
|The occurrence of the keyword twice or more raises the relevance of the result.|
|Number of the occurrences in the body of the text||1) Less than 2 occurrences. |
2) 2-7 occurrences.
3) More than 7 occurrences.
|Occurrence of the keyword once in the body of the document (not in a linear way, by discreet factors «2», «3», «4», «5», «6», «7 and more»); |
Exactly once do not change the relevance; The absence is decreasing the relevance.
|Position of the keyword inside the text.||A value in the interval from 0.25 to 1||Position of the word presented by a number from 0 (end of the document) to 1 (beginning of the document); the bigger the value the higher the relevancy(non-linear, by uninterrupted values of the factors) (The value of 0.25 is chosen assumption that keyword appearance is relevant from the beginning to the 75% of the text, values below 0.25 do not change relevancy )|
|The number of occurrences of a pair of the keywords in the title of the document||1) 0 occurrences |
2) 1 or more occurrences
|Occurrence in the position title of a phrase (bigram), coincided with a two keyword phrase raises the relevancy; the absence of the phrase does not change the relevancy.|
|The number of occurrences of the pair of the keywords in the text body.||1) 0 occurrences |
2) From 1 to 4.
3) More than 4/
|The occurrence in the whole text (descriptions+annotations and etc.) a bigram 1 or more times raises the relevancy (non-linear, by discreet values of factor «1», «2», «3», «4 and more»); the absence does not change relevancy|
|The value of the TF*IDF factor for the keyword.||1) 0 |
2) a value in the interval of 0 to 4
2) value more than 4
|The bigger the value of TF*IDF, which considers the frequency of the keyword(TF) and the weight of the word(IDF), raise the relevancy of the document (non-linear, by uninterruptible values of factors in the interval from 0 to 4, values 4 and more- maximal); a value of 0 does not change the relevancy|
|The date of publishing |
(this is optional and related to the topic of the search)
|Value in the interval of 0 – number of days.||This factor presents the number of days since the text was published, until the present (today) date. The bigger the value the less the relevancy (non-linear by discreet values 1, 2, and the maximum number was input); A value of 0 (today) does not change relevancy.|
Another factor can be added as the ratio of the size of images presented to the overall size of the page, to indicate that the result will not be presented as 4 images page with two lines of text, where a ratio of 1 will be irrelevant down to 0 relevant, or in another word the less, the value the more the relevancy.
First classifying the documents irrelevantly to their connection to the query word, as it disposes of the pages which were displayed in the results based only on manipulating the ranking system of the search engines. The classifier should be applied first to eliminate all the spam pages, then we apply or add the previously mentioned features distinguish the non-spam pages which resulted in after the search by the grade of usefulness to the user, or as it was mentioned determining the relevancy. We apply the previous features to the classifier to be trained and by repeating this operation and assigning a classifier to each feature to be repeated and manually identify the pages as spam and nonspam we can achieve higher results.
The most effective junk filter will not consist of one factor or feature, a combination of all the factors could create the optimal variant, keeping in mind that it will not work right from the start unless there was a pilot study applied that changed and adapted according to the manual distinction until it is fully automated. The individuality can change and adapt any classifier based on personal needs, and in our case, a general classifier was modified. It is certain that it has to be an adaptation over time, and the creation of new methods eventually will allow us if not to reach perfection but to step forward in that direction indeed. The wide availability of information nowadays and its accessibility put us in a great need to filter all the information we obtain from the web and try to ignore all the imposed commercial junk.
Petersen, S. E. and Ostendorf. “Text Simplification for Language Learners: A Corpus Analysis”.(n.d.a) 2008 Washington. Dept. of Computer Science. Web.
Petersen, S. E. and Ostendorf, M. “Assessing the Reading Level of Web Pages”. (2005) Washington. Dept. of Computer Science. 2008. Web.
Peterson, S. E. “Natural Language Processing Tools for Reading Level Assessment and Text Simplification for Bilingual Education.” (2007.) University of Washington. 2008. Web.
Schwarm & Ostendorf. “Reading Level Assessment Using Support Vector Machines and Statistical Language Models”. (2005.) Web.
D. Fetterly, M. Manasse and M. Najork. “Spam, Damn Spam,and Statistics: Using statistical analysis to locate spam web pages”. (2004). Workshop on the Web and Databases. 2008. Web.
Perkins A. “The Classiﬁcation of search engine spam.” Silver disk. (2001). Web.
Westbrook, Andrew & Green, Russel. “Using Semantic Analysis to Classify Search Engine Spam.” 2008. Web.