airportlooki.blogg.se - Document similarity apache lucene

Rationale: common terms are less important than uncommon ones Implication: the greater the occurrence of a term in different documents, the lower its score Implementation: log(numDocs/(docFreq+1)) + 1 Rationale: documents which contains more of a term are generally more relevant Implication: the more frequent a term occurs in a document, the greater its score Note: the implication of these factors should be read as, "Everything else being equal.

The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are: boost (query) = boost of the field at query-time.boost (index) = boost of the field at index-time.queryNorm = normalization factor so that queries can be compared.lengthNorm = measure of the importance of a term according to the total number of terms in the field.coord = number of terms in the query that were found in the document.idf = inverse document frequency = measure of how often the term appears across the index.tf = term frequency in document = measure of how often a term appears in the document.The factors involved in Lucene's scoring algorithm are as follows: Lucene implements a variant of the Tf-Idf scoring model. The authoritative document for scoring is found on the Lucene site here.