Data Mining (94 page)

Authors: Mehmed Kantardzic

BOOK: Data Mining

6Mb size Format: txt, pdf, ePub

Figure 11.8.
Singular value decomposition of initial data.

With this initial step, we have already reduced the number of dimensions representing each document from 10 to four. We now show how one would go about further reducing the number of dimensions. When the data set contains many documents, the previous step is not enough to reduce the number of dimensions meaningfully. To perform further reduction, we first observe that matrix
S
provides us with a diagonal matrix of eigenvalues in descending order as follows: λ
₁, … , λ
₄= {3.869, 2.344, 1.758, 0.667}. We will be able to capture most of the variability of the data by retaining only the first k eigenvalues rather than all n terms (in our matrix S, n = 4). If for example k = 2, then we retain
or 85% of the variability when we move from the four new dimensions per document to only two. The rank 2 approximations can be seen in Figure
11.9
. This rank 2 approximation is achieved by selecting the first k columns from
U
, the upper left k × k matrix from
S,
and the top k rows from
V
^Tas shown in Figure
11.9
.

Figure 11.9.
Rank 2 approximation of the singular value decomposition.

The rank k approximation to
V
^Tgives us our dimensionally reduced data set where each document is now described with only two dimensions.
V
^Tcan be thought of as a transformation of the original document matrix and can be used for any of a number of text-mining tasks, such as classification and clustering. Improved results may be obtained using the newly derived dimensions, comparing against the data-mining tasks using the original word counts.

In most text-mining cases, all training documents (in our case 4) would have been included in matrix A and transformed together. Document five (d
₅) was intentionally left out of these calculations to demonstrate a technique called “folding-in,” which allows for a small number of documents to be transformed into the new reduced space and compared with a training document database. Matrices from our previous SVD calculations are used for this transformation. This is done using the following modified formula:
. This equation is a rearrangement of terms from the initial SVD equation replacing the original data set, matrix A, with the term counts for document five (d
₅) as matrix A’. The resulting multiplication can be seen in Figure
11.10
. The result is the document d
₅represented in the reduced 2-D space, d
₅= [−0.172, 0.025].

Figure 11.10.
SVD calculations to produce a 2-D approximation for d
₅.

To visualize the transformed data, we have plotted our example documents, d
_1..5, in Figure
11.11
. We now perform the nearest neighbor classification algorithm. If the task is to disambiguate the term “bank,” and the possible classes are “financial institution” and “basketball shot,” then a review of the original text reveals that documents d
₁, d
₂, and d
₅are of the class “financial institution” and documents d
₃and d
₄are of the class “basketball shot.” To classify test document d
₅, we compare d
₅with each other document. The closest neighbor will determine the class of d
₅. Ideally, d
₅should be nearest to d
₁and d
₂, and farthest from d
₃and d
₄, as pointed earlier. Table
11.4
shows these calculations using the Euclidean distance metric. The assessment for which document is closest is made based on both the original 10 dimensions and the reduced set of two dimensions. Table
11.5
shows the same comparison using cosine similarity to compare documents. Cosine similarity is often used in text-mining tasks for document comparisons.

TABLE 11.4.
Use of Euclidean Distance to Find Nearest Neighbor to d
₅in Both 2-d and 10-d (the Smallest Distance Ranks First)

TABLE 11.5.
Use of Cosine Similarity to Find Most Similar Document to d
₅in Both 2-d and 10-d (the Largest Similarity Ranks First)

Figure 11.11.
2-D plot of documents and query using LSA.

Euclidean distance as shown in Table
11.4
when using 10 dimensions ranks d
₃and d
₄above d
₁and d
₂. By the formulation of Euclidean distance, when two documents both share a term or both do not share a term, the result is a distance of 0 for that dimension. After applying the LSA transformation, the Euclidean distance ranks d
₁above d
₃and d
₄. However, document d
₂only is ranked above d
₃and not d
₄.

Cosine similarity calculates the cosine of the angle between two vectors representing points in n-dimensional space. The result is a similarity score between 0 and 1, where a 1 shows highest similarity, and 0 shows no similarity. For document vector comparisons, no additional strength of similarity is added by two vectors containing a 0 for a specific term. This is a very beneficial characteristic for textual applications. When we consider Table
11.5
we see that without the transformation, in 10-D space, d
₂is ranked above d
₃and d
₄for containing both terms in d
₅. However, d
₁is ranked below d
₃and d
₄for having more terms than these other sentences. After the LSA transformation, d
₁and d
₂are ranked above d
₃and d
₄, providing the expected ranking. In this simple example the best results occurred when we first transformed the data using LSA and then used cosine similarity to measure the similarity between initial training documents in a database and a new document for comparison. Using the nearest neighbor classifier, d
₅would be classified correctly as a “financial institution” document using cosine similarity for both the 10-d and 2-d cases or using Euclidean distance for the 2-d case. Euclidean distance in the 10-d case would have classified d
₅incorrectly. If we used k-nearest neighbor with k = 3, then Euclidean distance in the 10-d case would have also incorrectly classified d
₅. Clearly, the LSA transformation affects the results of document comparisons, even in this very simple example. Results are better because LSA enable better representation of a document’s semantics.

Other books

Hot for His Hostage by Angel Payne

Nickolai's Noel by Alicia Hunter Pace

Echoes of the Heart by Alyssa J. Montgomery

The Edge of Normal by Carla Norton

Final Demand by Deborah Moggach

Los hijos de los Jedi by Barbara Hambly

The Bomber Dog by Megan Rix

A Pig of Cold Poison by Pat McIntosh

The Art Of The Next Best by Deborah Nam-Krane

Shameless Exposure by Fanshaw, Robert