Data Mining (89 page)

Read Data Mining Online

Authors: Mehmed Kantardzic

BOOK: Data Mining
3.2Mb size Format: txt, pdf, ePub

Figure 11.2.
Initialization of the HITS algorithm. (a) Subgraph of the linked pages; (b) adjacency matrix A and weight vectors for the given graph.

The first iteration of the HITS algorithm will give the changes in the a and h vectors:

Even this single iteration of the HITS algorithm shows that, in the given set of documents, document-5 has the most authority and document-1 is the best hub. Additional iterations will correct the weight factors for both vectors, but the obtained ranking of authorities and hubs will stay unchanged for this example.

The continuous growth in the size and use of the Internet creates difficulties in the search for information. Resource discovery is frustrating and inefficient when simple keyword searches can convey hundreds of thousands of documents as results. Many of them are irrelevant pages, some of them may have been moved, and some abandoned. While the first Web-mining algorithm HITS is primarily based on static information describing the Web-site structure, the second one, LOGSOM, uses dynamic information about a user’s behavior. LOGSOM is a sophisticated method that organizes the layout of the information in a user-interpretable graphic form. The LOGSOM system uses self-organizing maps (SOM) to organize Web pages into a two-dimensional (2-D) table, according to users’ navigation patterns. The system organizes Web pages according to the interest of Web users by keeping track of their navigation paths.

The SOM technique is used as the most appropriate technique for the problem of Web-page organization because of its strength not only in grouping data points into clusters, but also in graphically representing the relationship among clusters. The system starts with a Web-log file indicating the date, time, and address of the requested Web pages as well as the IP address of the user’s machine. The data are grouped into meaningful transactions or sessions, where a transaction is defined by a set of user-requested Web pages. We assume that there is a finite set of unique URLs:

and a finite set of m user transactions:

Transactions are represented as a vector with binary values u
i
:

where

Preprocessed log files can be represented as a binary matrix. One example is given in Table
11.1
.

TABLE 11.1.
Transactions Described by a Set of URLs

Since the dimensions of a table (n × m) for real-world applications would be very large, especially as input data to SOM, a reduction is necessary. By using the K-means clustering algorithm, it is possible to cluster transactions into prespecified number k (k
m) of transaction groups. An example of a transformed table with a new, reduced data set is represented in Table
11.2
, where the elements in the rows represent the total number of times a group accessed a particular URL (the form of the table and values are only one illustration, and they are not directly connected with the values in Table
11.1
).

TABLE 11.2.
Representing URLs as Vectors of Transaction Group Activity

The new, reduced table is the input for SOM processing. Details about the application of SOM as a clustering technique and the settings of their parameters are given in the previous chapter. We will explain only the final results and their interpretation in terms of Web-page analysis. Each URL will be mapped onto a SOM based on its similarity with other URLs in terms of user usage or, more precisely, according to users’ navigation patterns (transaction group “weights” in Table
11.2
). Suppose that the SOM is a 2-D map with p × p nodes, where p × p ≥ n, then a typical result of SOM processing is given in Table
11.3
. The dimensions and values in the table are not the results of any computation with values in Tables
11.1
and
11.2
, but a typical illustration of the SOM’s final presentation.

TABLE 11.3.
A Typical SOM Generated by the Description of URLs

Other books

To Crush the Moon by Wil McCarthy
One Coffee With by Margaret Maron
Devil's Business by Kittredge, Caitlin
We Are All Made of Stars by Rowan Coleman
Qumrán 1 by Eliette Abécassis