Thursday, July 31, 2014

Vector space model - Wikipedia, the free encyclopedia

Documents and queries are represented as vectors by .

d_j = ( w_{1,j} ,w_{2,j} , \dotsc ,w_{t,j} )
q = ( w_{1,q} ,w_{2,q} , \dotsc ,w_{n,q} )

Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting (see the example below).

The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).

Vector operations can be used to compare documents with queries.

Applications[edit]

Vector space model.jpg

Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.

In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:

  \cos{\theta} = \frac{\mathbf{d_2} \cdot \mathbf{q}}{\left\| \mathbf{d_2} \right\| \left \| \mathbf{q} \right\|}

Where \mathbf{d_2} \cdot \mathbf{q} is the intersection (i.e. the dot product) of the document (d2 in the figure to the right) and the query (q in the figure) vectors, \left\| \mathbf{d_2} \right\| is the norm of vector d2, and \left\| \mathbf{q} \right\| is the norm of vector q. The norm of a vector is calculated as such:

  \left\| \mathbf{q} \right\| = \sqrt{\sum_{i=1}^n q_i^2}

As all vectors under consideration by this model are elementwise nonnegative, a cosine value of zero means that the query and document vector are orthogonal and have no match (i.e. the query term does not exist in the document being considered). See cosine similarity for further information.

Example: tf-idf weights[edit]

In the classic vector space model proposed by Salton, Wong and Yang [1] the term-specific weights in the document vectors are products of local and global parameters. The model is known as term frequency-inverse document frequency model. The weight vector for document d is \mathbf{v}_d = [w_{1,d}, w_{2,d}, \ldots, w_{N,d}]^T, where

  w_{t,d} = \mathrm{tf}_{t,d} \cdot \log{\frac{|D|}{|\{d' \in D \, | \, t \in d'\}|}}

and

  • \mathrm{tf}_{t,d} is term frequency of term t in document d (a local parameter)
  • \log{\frac{|D|}{|\{d' \in D \, | \, t \in d'\}|}} is inverse document frequency (a global parameter). |D| is the total number of documents in the document set; |\{d' \in D \, | \, t \in d'\}| is the number of documents containing the term t.

Using the cosine the similarity between document dj and query q can be calculated as:

\mathrm{sim}(d_j,q) = \frac{\mathbf{d_j} \cdot \mathbf{q}}{\left\| \mathbf{d_j} \right\| \left \| \mathbf{q} \right\|} = \frac{\sum _{i=1}^N w_{i,j}w_{i,q}}{\sqrt{\sum _{i=1}^N w_{i,j}^2}\sqrt{\sum _{i=1}^N w_{i,q}^2}}

Read full article from Vector space model - Wikipedia, the free encyclopedia

Norm (mathematics) - Wikipedia, the free encyclopedia

a norm is a function that assigns a strictly positive length or size to each vector in a vector space, other than the zero vector (which has zero length assigned to it). A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition to the zero vector).
A norm must also satisfy certain properties pertaining to scalability and additivity which are given in the formal definition below.
A simple example is the 2-dimensional Euclidean space R2 equipped with the Euclidean norm. Elements in this vector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional cartesian coordinate system starting at the origin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean norm is often known as the magnitude.
A vector space on which a norm is defined is called a normed vector space. Similarly, a vector space with a seminorm is called a seminormed vector space. It is often possible to supply a norm for a given vector space in more than one way.

举一个简单的例子,一个二维度的欧氏几何空间\R^2就有欧氏范数。在这个向量空间(譬如:(3,7))的元素常常在笛卡儿座标系统被画成一个从原点出发的箭号。每一个向量的欧氏范数就是箭号的长度。
拥有范数的向量空间就是赋范向量空间。同样,拥有半范数的向量空间就是赋半范向量空间。

http://mathworld.wolfram.com/Norm.html
The norm of a mathematical object is a quantity that in some (possibly abstract) sense describes the length, size, or extent of the object. Norms exist for complex numbers (the complex modulus, sometimes also called the complex norm or simply "the norm"), Gaussian integers (the same as the complex modulus, but sometimes unfortunately instead defined to be the absolute square), quaternions (quaternion norm), vectors (vector norms), and matrices (matrix norms). A generalization of the absolute value known as the p-adic norm is also defined.
Norms are variously denoted |x||x|_p||x||, or ||x||_p. In this work, single bars are used to denote the complex modulusquaternion normp-adic norms, and vector norms, while the double bar is reserved formatrix norms.
The term "norm" is often used without additional qualification to refer to a particular type of norm (such as a matrix norm or vector norm). Most commonly, the unqualified term "norm" refers to the flavor of vector norm technically known as the L2-norm. This norm is variously denoted ||x||_2||x||, or |x|, and gives the length of an n-vector x=(x_1,x_2,...,x_n). It can be computed as
 |x|=sqrt(x_1^2+x_2^2+...+x_n^2).
The norm of a complex number, 2-norm of a vector, or 2-norm of a (numeric) matrix is returned by Norm[expr]. Furthermore, the generalized p-norm of a vector or (numeric) matrix is returned by Norm[exprp].
Read full article from Norm (mathematics) - Wikipedia, the free encyclopedia

Tuesday, July 29, 2014

Shannon's entropy | planetmath.org

Let X be a discrete random variable on a finite set X={x1,,xn}, with probability distribution function p(x)=Pr(X=x). The entropy H(X) of X is defined as

H(X)=xXp(x)logbp(x).
(1)

The convention 0log0=0 is adopted in the definition. The logarithm is usually taken to the base 2, in which case the entropy is measured in "bits," or to the base e, in which case H(X) is measured in "nats."

If X and Y are random variables on X and Y respectively, the joint entropy of X and Y is

H(X,Y)=(x,y)X×Yp(x,y)logbp(x,y),

where p(x,y) denote the joint distribution of X and Y.


Read full article from Shannon's entropy | planetmath.org

random variable | planetmath.org

If (Ω,A,P) is a probability space, then a random variable on Ω is a measurable function X:(Ω,A)S to a measurable space S (frequently taken to be the real numbers with the standard measure). The law of a random variable is the probability measure PX1:SR defined by PX1(s)=P(X1(s)).

A random variable X is said to be discrete if the set {X(ω):ωΩ} (i.e. the range of X) is finite or countable. A more general version of this definition is as follows: A random variable X is discrete if there is a countable subset B of the range of X such that P(XB)=1 (Note that, as a countable subset of R, B is measurable).

A random variable Y is said to be continuous if it has a cumulative distribution function which is absolutely continuous.


Read full article from random variable | planetmath.org

Labels

Popular Posts