Example: A Search Engine
Introduction
Search engines are used in a variety of contexts to locate documents that match or satisfy a query. Queries consist of sequences of search terms -- words that are meant to prescribe the subject matter of matching documents. Examples of search engines include Web search engines and database interfaces. However, search engines can be adapted to search a wide variety of structured or unstructured data.
This example illustrates how a simple search engine, based on the vector space model, can be constructed using Maple. Various Maple packages are used to pre- and post-process the data to be searched, as well as the queries, and the LinearAlgebra package is used to effect the computations necessary to locate relevant documents.
Introduction to Searching
Prototypically, a document is a file of structured or unstructured text, but this section treats documents as abstract data items. A document can be a scientific article in LaTeX format, a web page, an integral formula, an image, or a simple string of text, as used here for illustrative purposes. The document is a raw piece of data that can be identified as a whole. Each document is equipped with a document ID, that is, an identifier that is used as a handle for the document data. The document ID is small, while the document (containing data) may be arbitrarily large.
For searching, documents are organized into corpora. A corpus is a collection of documents that can be searched collectively using search terms. A query is a list of search terms. A search term is typically a (natural language) word or phrase. The purpose of a search is to identify, among documents in a given corpus, those which match the search terms. Thus, search input consists of a corpus to search, and one or more search terms. The output is a ranked list of documents that the search criteria judge to be relevant to the search terms.
For this example, consider the following simple corpus of documents. Each document is a short string of text, which serves as both the document ID and the document data.
doc1 := "The mathematician's patterns, like the painter's " "or the poet's must be beautiful; the ideas, like " "the colors or the words must fit together in a " "harmonious way. Beauty is the first test: there " "is no permanent place in this world " "for ugly mathematics.": # Hardy
doc2 := "God does arithmetic.": # Karl Friedrich Gauss
doc3 := "Anyone who cannot cope with mathematics is not" " fully human. At best he is a tolerable subhuman " "who has learned to wear shoes, bathe, and not make " "messes in the house.": # Robert A. Heinlein
doc4 := "Things should be made as simple as possible, " "but no simpler.": doc5 := "I don't believe in mathematics.": doc6 := "Imagination is more important than knowledge.": doc7 := "The most beautiful thing we can experience is " "the mysterious. It is the source of all true " "art and science.": doc8 := "Common sense is the collection of prejudices " "acquired by age eighteen.": doc9 := "God does not care about our mathematical " "difficulties. He integrates empirically.": # A. Einstein
The corpus consists of these documents. To facilitate searches, the corpus is preprocessed to construct a search index. The index aids searches by making it easy to quickly locate documents in the corpus that are relevant to one or more of the search terms in a query.
Inverted Term Occurrence Indexing
An inverted term occurrence index is a simple search index. An inverted term occurrence index constructs a list of all the potential search terms present in the corpus and maintains a list of the documents in which each search term occurs. Construction of an inverted term occurrence index begins by constructing a document-term mapping. For simple text strings, construct a list of document terms as follows.
DocumentTerms := proc( text::string ) StringTools:-Words( text ) end proc:
DocumentTerms( doc1 );
The,mathematician's,patterns,like,the,painter's,or,the,poet's,must,be,beautiful,the,ideas,like,the,colors,or,the,words,must,fit,together,in,a,harmonious,way,Beauty,is,the,first,test,there,is,no,permanent,place,in,this,world,for,ugly,mathematics
Using this, construct an inverted term occurrence index.
BuildIndex := proc( corpus::list(string) ) local docterms, corpusterms, index, term, doc; use StringTools in # Construct all terms in the corpus docterms := table( [seq]( doc = DocumentTerms( doc ), doc = corpus ) ); corpusterms := `union`( seq( {op}( docterms[ doc ] ), doc = corpus ) ); # Map each term to the documents containing it index := table(); for doc in corpus do for term in docterms[ doc ] do if assigned( index[ term ] ) then index[ term ] := index[ term ] union { doc }; else index[ term ] := { doc }; end if; end do; end do; end use; # Return the table eval( index, 1 ); end proc:
Index := BuildIndex( [ doc || ($1..9) ] ):
nops( {indices}( Index ) );
104
Searching is simple, using this index. Search for the terms "mathematical" and "beauty".
Search := proc( query::list(string) ) global Index; local results, term; results := {}; for term in query do if assigned( Index[ term ] ) then results := results union Index[ term ]; end if; end do; results; end proc:
Search( [ "mathematical", "beauty" ] );
God does not care about our mathematical difficulties. He integrates empirically.
nops( (3) );
1
There are several problems with this index. One problem is that the index is quite large (relative to the size of the corpus). Many words that occur in the corpus convey little or no information about the content of the documents in which they occur. This can lead to irrelevant results, especially for poorly chosen search terms.
nops( Search( [ "the" ] ) );
4
This problem can be solved by removing unimportant terms from the index. This set of unwanted terms is called a stop list.
STOP_WORDS := { "a", "i", "an", "the", "in", "to", "which", "that", "is", "and", "we", "it", "of", "all", "can", "does", "don't", "most", "true", "thing" }:
A second problem is synonymy in the corpus. That is, many distinct terms in the corpus have the same meaning in the context of searching. For example, searching for the term "mathematics" should return documents that contain the terms "mathematical", "math" and "mathematically", even if the exact term "mathematics" does not occur in the document. To solve this problem, map each term onto its stem, or root term.
To implement these changes in the DocumentTerms procedure, use the StringTools package.
DocumentTerms := proc( text::string ) global STOP_WORDS; local words; use StringTools in words := map( LowerCase, Words( text ) ); words := remove( type, words, STOP_WORDS ); map( Stem, words ); end use; end proc:
By using the LowerCase function, case distinctions are removed, making the process more efficient. Apply the same preprocessing to search terms: convert to lowercase, remove stop list words, and map words onto their stems.
Search := proc( query::list(string) ) global Index; local results, term, terms; use StringTools in terms := map( LowerCase, query ); terms := remove( type, terms, STOP_WORDS ); terms := map( Stem, terms ); end use; results := {}; for term in terms do if assigned( Index[ term ] ) then results := results union Index[ term ]; end if; end do; results; end proc:
Because BuildIndex uses DocumentTerms, rebuild the index.
83
This has substantially reduced the size of the index.
Search( [ "the" ] );
∅
Because the stop word "the" has been removed from the index, it returns no matches. This is also a user interface enhancement because queries can contain stop words without affecting the results.
I don't believe in mathematics.,God does not care about our mathematical difficulties. He integrates empirically.,The most beautiful thing we can experience is the mysterious. It is the source of all true art and science.,Anyone who cannot cope with mathematics is not fully human. At best he is a tolerable subhuman who has learned to wear shoes, bathe, and not make messes in the house.,The mathematician's patterns, like the painter's or the poet's must be beautiful; the ideas, like the colors or the words must fit together in a harmonious way. Beauty is the first test: there is no permanent place in this world for ugly mathematics.
nops( (8) );
5
The new process returns many more documents relevant to the query.
The Vector Space Model
It is possible to model the search by using a vector space model of the term-document indices for corpora. After collecting the relevant search terms for a corpus, represent each document by a vector in a Euclidean space En, where n is the number of distinct terms in the corpus. Each coordinate in this vector space represents a term in the corpus. First determine an ordering of the corpus terms. Then map each document to a term vector in this Euclidean space. A simple mapping is to represent a document by a term vector whose ith coordinate is equal to 1 if the ith term occurs in the document, and is equal to 0 otherwise. For each query, construct a term vector. The dot product of the query term vector with a document term vector is the number of query terms that appear in the document. A document containing more query terms might reasonably be judged more relevant than one containing fewer query terms. To rank the documents, sort the dot products of the query term vector with the document term vectors.
For example, consider a smaller corpus comprising only the documents doc2, doc5, and doc7.
SmallIndex := BuildIndex( [ doc2, doc5, doc7 ] ):
The ordered list of corpus terms is:
SmallCorpusTerms := sort( [op]( map( op, {indices}( SmallIndex ) ) ) );
SmallCorpusTerms≔arithmet,art,beauti,believ,experi,god,mathemat,mysteri,scienc,sourc
Note: Terms in the documents are replaced by their stems and stop words are removed from the index. The document term vector can be computed using the following procedure.
DocumentTermVector := proc( doc ) global SmallCorpusTerms; local terms; terms := DocumentTerms( doc ); Vector[row]( 1 .. nops( SmallCorpusTerms ), i -> `if`( member( SmallCorpusTerms[ i ], terms ), 1, 0 ) ); end proc:
doc5Vector := DocumentTermVector( doc5 );
doc5Vector≔0001001000
Compute the query term vector for the search term "mathematical beauty".
queryVector := DocumentTermVector( "mathematical beauty" );
queryVector≔0010001000
The inner product is:
queryVector . doc5Vector;
which indicates that one of the query terms (corresponding to "mathematical") appears in the document.
To rank the documents in the corpus for their relevance to the query, apply this process to each document in the corpus.
queryVector . DocumentTermVector( doc2 ); queryVector . DocumentTermVector( doc5 ); queryVector . DocumentTermVector( doc7 );
0
It is more efficient to represent the entire corpus by a term-document matrix, in which the rows are the term vectors for the documents in the corpus. First determine a fixed ordering of the documents in the corpus. The dot products can then be computed by forming the product of the term-document matrix representing the corpus with the term vector for a query.
TermDocumentMatrix := Matrix( map( [s->s], map( DocumentTermVector, [ doc2, doc5, doc7 ] ) ) );
TermDocumentMatrix≔100001000000010010000110100111
Scores := TermDocumentMatrix . queryVector^+;
Scores≔011
There is a geometric interpretation of the search process. The dot product of two vectors, u and v, is related to the angle between the vectors, for which the cosine is given by the formula:
u,v||u||·||v||
Consider the projection of the document term vectors onto the hyperplane defined by the component vectors of the query vector, and then introduce appropriate scaling, using vector norms. In this context, searching can be viewed as a process of determining those vectors representing documents in a corpus for which the angle between their projections and the query vector is small.
Term Weighting
A document that contains many instances of a given term is generally considered to be more relevant to queries containing that term than a document with fewer instances. To improve rankings, therefore, record not only the presence of a term in a document, but also its frequency. This is a simple change to DocumentTermMatrix. The term vector of a document is the vector of En whose ith entry is equal to the number of times the ith corpus term occurs in the document.
DocumentTermVector := proc( doc ) global SmallCorpusTerms; local terms; terms := DocumentTerms( doc ); Vector[row]( 1 .. nops( SmallCorpusTerms ), i -> numboccur( terms, SmallCorpusTerms[ i ] ) ); end proc:
This can lead to significantly improved results when searching a larger corpus.
To improve this method, scale the number of instances by the size (the total number of terms with multiplicities) of the document. For example, a book about cats is not more relevant than a short paper on cats merely because the term "cats" appears more often in the book than in the short article.
DocumentTermVector := proc( doc ) global SmallCorpusTerms; local terms; terms := DocumentTerms( doc ); Vector[row]( 1 .. nops( SmallCorpusTerms ), i -> evalf( numboccur( terms, SmallCorpusTerms[ i ] ) / nops( terms ) ) ); end proc:
With this change, recompute the term-document matrix and the matrix of scores, which represents the search results. Also recompute the query term vector.
TermDocumentMatrix := Matrix( map( [s->s], map( DocumentTermVector, [ doc2, doc5, doc7 ] ) ) ):
queryVector := DocumentTermVector( "mathematical beauty" ):
Scores≔0.0.2500000000000000.0833333333500000
According to these results, the second document in the corpus (doc5) is the most relevant to the query, followed by the third document (doc7). The first document (doc2) is judged to have no relevance to the query.
Building a Search Engine Package
The next step is to design a search engine package that includes all the features described to this point. The search engine must also be as generic as possible, and accept a variety of document and term types.
The package manages two kinds of data objects: documents and corpora. Each is represented by an object that supports certain features.
A document is abstracted as a record object with three slots. The constructor is:
Document := proc( id, fetch, filter ) Record( ':-id' = id, ':-fetch' = fetch, ':-filter' = filter ); end proc:
The id slot is the document ID. The document ID must uniquely represent the document within the corpus. For efficiency, document IDs are small, while their referents may be quite large. While enormous documents can be handled in the design, for simplicity, it is assumed that any document can reasonably be read into memory in its entirety. The fetch slot is a procedure that returns the body of a document, given its document ID. The filter slot contains a procedure that generates a list of terms appearing in the document, with multiplicities. Several accessor procedures for documents are also provided in the package.
A corpus is a collection of documents. It is represented by an object that supports methods for accessing, indexing, and searching the collection.
SearchEngine := module() description "a simple search engine"; option package; export Filters, # subpackage of filters Document, # document constructor DocumentID, # accessor procedure FilterDocument, # accessor procedure FetchDocument, # accessor procedure Corpus, # corpus constructor NumberOfDocuments, # returns number of documents in corpus BuildIndex, # index a corpus GetDocumentIdByIndex, # retrieve a document ID Search; # perform a search local documentTermVector, # construct document term vector queryTermVector; # construct query term vector # Exports # Document constructor Document := proc( id, fetch, filter ) Record( ':-id' = id, ':-fetch' = fetch, ':-filter' = filter ); end proc; # Accessor routines for documents. DocumentID := doc -> doc:-id; FetchDocument := doc -> doc:-fetch( doc:-id ); FilterDocument := doc -> doc:-filter( FetchDocument( doc ) ); # Corpus constructor. Called with either a sequence of # documents or a list of document IDs, a fetcher, a document # filter routine, and an optional query filter routine. Corpus := proc( listOfIds::list, fetch, filter, _qfilter ) local docs, qfilter; # Process arguments. if nargs = 0 then error "expecting corpus description" elif nargs > 3 then # Allow the query filter to be different # from the document filter qfilter := eval( _qfilter, 2 ); else # If query filter is not specified, # use the document filter. qfilter := eval( filter, 2 ); end if; # Construct list of documents. docs := map( Document, listOfIds, fetch, filter ); # Build the corpus. module() export search, buildIndex, numberOfDocuments, getDocumentIdByIndex; local ids, corpusTerms, documents, term_document_matrix; ids := listOfIds; documents := docs; numberOfDocuments := () -> nops( documents ); getDocumentIdByIndex := proc( idx::posint ) if idx <= numberOfDocuments() then ids[ idx ]; else error "there are fewer than %1 documents in the corpus", idx; end if; end proc; buildIndex := proc() local docterms; # Construct corpus terms. docterms := map( FilterDocument, documents ); corpusTerms := sort( [op]( `union`( op( map( {op}, docterms ) ) ) ) ); # Build the term-document matrix. term_document_matrix := Matrix( map( [s -> s], map( documentTermVector, docs, corpusTerms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); eval( thismodule, 1 ); end proc; search := proc( query, numberOfResults::posint ) local qt, qv, scores; if not type( term_document_matrix, 'Matrix' ) then error "corpus not yet indexed"; end if; qt := qfilter( query ); qv := queryTermVector( qt, corpusTerms ); scores := (term_document_matrix . qv)^+; end proc; end module; end proc; NumberOfDocuments := corpus -> corpus:-numberOfDocuments(); GetDocumentIdByIndex := ( corpus, idx ) -> corpus:-getDocumentIdByIndex( idx ); BuildIndex := corpus -> corpus:-buildIndex(); Search := ( corpus, query ) -> corpus:-search( query ); # Locals documentTermVector := proc( doc, corpusTerms::list ) local terms; terms := FilterDocument( doc ); Vector[row]( 1 .. nops( corpusTerms ), i -> evalf( numboccur( terms, corpusTerms[ i ] ) / nops( terms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); end proc; queryTermVector := proc( queryTerms::list, corpusTerms::list ) Vector[column]( 1 .. nops( corpusTerms ), i -> evalf( numboccur( queryTerms, corpusTerms[ i ] ) / nops( queryTerms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); end proc; # Filters subpackage Filters := module() description "filter subpackage"; option package; export Text; local stopWords; stopWords := { "a", "i", "an", "the", "in", "to", "which", "that", "is", "and", "we", "it", "of", "all", "can", "does", "don't", "most", "true", "thing" }: Text := proc( text::string ) local words; use StringTools in words := map( LowerCase, Words( text ) ); words := remove( type, words, stopWords ); map( Stem, words ); end use; end proc; end module; end module:
with( SearchEngine ):
corpus := Corpus( [ doc || ($1..9) ], s -> s, Filters:-Text ):
NumberOfDocuments( corpus );
9
GetDocumentIdByIndex( corpus, 1 );
The mathematician's patterns, like the painter's or the poet's must be beautiful; the ideas, like the colors or the words must fit together in a harmonious way. Beauty is the first test: there is no permanent place in this world for ugly mathematics.
BuildIndex( corpus ):
Search( corpus, "mathematical beauty" );
0.04838709677500000.0.02083333333500000.0.2500000000000000.0.08333333335000000.0.0500000000000000
Latent Semantic Analysis
The simple vector space model described previously has shortcomings. Suppose a user searches a corpus of documents about pets for the term "cat". The simple vector space model search engine locates all the documents that contain the term "cat". However, the search engine does not locate documents that contain the word "feline", but not the word "cat" (or any term for which the stem is "cat"). This issue is called synonymy -- the problem that distinct search terms may represent the same concept. One way to circumvent this problem is to have a human domain expert prepare search term lists for each document in the corpus. All documents referring either to "cats" or "felines" would contain the search term "cat" included in their list of search terms. This solution is, however, very expensive.
An automatic indexing procedure known as latent semantic analysis (LSA) helps to discover relations between terms and documents that are latent in the corpus by analyzing the corpus as a whole. The process is related to factor analysis, and is essentially a statistical technique, relying heavily on linear algebra. When used to prepare a search index for a corpus, LSA is referred to as latent semantic indexing (LSI).
A thorough discussion of LSI is beyond the scope of this manual, so only the operational details necessary to construct the SearchEngine package are discussed.
LSI computes a lower rank approximation to the term-document matrix. The approximation is computed by using the singular value decomposition of the term-document matrix. The singular value decomposition of a matrix A is a factorization:
A=U·S·VT
where U and V are column orthogonal matrices, and S is a diagonal matrix whose diagonal entries are arranged in descending order. The diagonal entries of S are called the singular values of A, the columns of U are called the left singular vectors of A, and the columns of V are called the right singular vectors of A. If A is a term-document matrix, then the columns of U can be interpreted as an orthogonal basis for the semantic factors of term-term correlations between terms in the corpus represented by A, while the columns of V can be interpreted as an orthogonal basis for the semantic factors of the correlations between documents. For large corpora, the rank of S may be on the order of a few thousand. To obtain a rank k approximation of A that is closest in the least squares sense, choose a rank k smaller than the rank of S (say, on the order of a few hundred), and form the matrix:
Ak=Uk·Sk·VkT
where Uk consists of the first k columns of U, Vk consists of the first k columns of V, and Sk is the first k×k submatrix of S.
When the matrix A is a term-document matrix, its approximation Ak is used as a surrogate corpus index for searching. It can be argued that the matrix Ak is better able to determine correlations between terms in such a way that searches using it are able to approximate results obtained by human expert indexing. For example, in a corpus on pets, some documents may contain the term "cat", others the term "feline", and still others may contain both terms. LSI places documents containing only one of these terms closer together in the lower dimensional projection by virtue of their co-occurrence in some of the documents in the corpus. In practice, the rank k chosen for the approximation is an empirically determined value, based on the quality of search results. Because LSI is a statistical technique, in general, it produces useful results only for large corpora.
The Search Engine Package
This section modifies the SearchEngine package to incorporate LSI without changing the interface that was designed for the simpler indexing scheme. The updated package contains more filters to allow for a greater variety of corpora.
SearchEngine := module() description "a generic search engine package"; option package; export Filters, # subpackage of filters Document, # document constructor DocumentID, FilterDocument, FetchDocument, Corpus, # corpus constructor NumberOfDocuments, BuildIndex, GetDocumentIdByIndex, Search; local Tools, # private submodule documentTermVector, queryTermVector; # Exports # Document constructor Document := proc( id, fetch, filter ) description "document constructor"; Record( ':-id' = id, ':-fetch' = fetch, ':-filter' = filter ); end proc; # Document accessors. DocumentID := doc -> doc:-id; FetchDocument := doc -> doc:-fetch( doc:-id ); FilterDocument := doc -> doc:-filter( FetchDocument( doc ) ); # Corpus constructor. Called with either a sequence of documents, # or a list of document IDs, a fetcher and a document filter # routine, and a query filter routine. Corpus := proc( listOfIds::list, fetch, filter, _qfilter ) description "corpus constructor"; local docs, qfilter; # Process arguments. if nargs = 0 then error "expecting corpus description"; elif nargs > 3 then # Allow the query filter to be different # than the document filter qfilter := eval( _qfilter, 2 ); else # If not query filter is specified, # use the document filter qfilter := eval( filter, 2 ); end if; # Construct list of documents. docs := map( Document, listOfIds, fetch, filter ); # Build the corpus. module() export search, buildIndex, numberOfDocuments, getDocumentIdByIndex; local ids, corpusTerms, documents, term_document_matrix; # Initialize private data. ids := listOfIds; documents := docs; # Accessor methods. numberOfDocuments := () -> nops( docs ); getDocumentIdByIndex := proc( idx::posint ) if idx <= numberOfDocuments() then ids[ idx ]; else error "there are fewer than %1 documents in the corpus", idx; end if; end proc; # Construct an index based on a _k-dimensional approximation # to the term-document matrix. buildIndex := proc( _k::posint ) local docterms, k, u, s, v; # Construct corpus terms. docterms := map( FilterDocument, documents ); corpusTerms := sort( [op]( `union`( op( map( {op}, docterms ) ) ) ) ); # Build the term-document matrix. term_document_matrix := Matrix( map( [s -> s], map( documentTermVector, docs, corpusTerms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); use LinearAlgebra in u, s, v := SingularValues( term_document_matrix, 'output' = [ ':-U', ':-S', ':-Vt' ] ); v := v^+; if nargs > 0 then k := _; else # Use a default if no dimension provided. k := floor( Rank( DiagonalMatrix( s ) ) * 0.7 ); end if; u := u[ 1 .. -1, 1 .. k ]; v := v[ 1 .. k, 1 .. -1 ]; s := DiagonalMatrix( s[ 1 .. k ] ); # Replace the term-document matrix with its rank # k approximation term_document_matrix := MatrixMatrixMultiply( u, MatrixMatrixMultiply( s, v ) ); end use; eval( thismodule, 1 ); end proc; search := proc( query, numberOfResults::posint ) local qt, qv, scores; if not type( term_document_matrix, 'Matrix' ) then error "corpus not yet indexed"; end if; qt := qfilter( query ); qv := queryTermVector( qt, corpusTerms ); scores := (term_document_matrix . qv)^+; Tools:-permSort( scores ); end proc; end module; end proc; NumberOfDocuments := corpus -> corpus:-numberOfDocuments(); GetDocumentIdByIndex := ( corpus, idx ) -> corpus:-getDocumentIdByIndex( idx ); BuildIndex := ( corpus, k ) -> `if`( nargs = 1, corpus:-buildIndex(), corpus:-buildIndex( k ) ); Search := ( corpus, query ) -> corpus:-search( query ); # Locals documentTermVector := proc( doc, corpusTerms::list ) local terms, norm; terms := FilterDocument( doc ); Vector[row]( 1 .. nops( corpusTerms ), i -> evalf( numboccur( terms, corpusTerms[ i ] ) / nops( corpusTerms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); end proc; queryTermVector := proc( queryTerms::list, corpusTerms::list ) Vector[column]( 1 .. nops( corpusTerms ), i -> evalf( numboccur( queryTerms, corpusTerms[ i ] ) / nops( corpusTerms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); end proc; # The Tools submodule Tools := module() export permSort; permSort := proc( V::Vector ) local partition, quickSort, n, P, i; partition := proc( a, lb, ub ) local i, j, k, v; i, j, k := lb, 1 + ub, lb; v := a[ k ]; while i < j do i := 1 + i; while i < j and a[ i ] < v do i := 1 + i; end do; j := j - 1; while a[ j ] > v do j := j - 1; end do; if i < j then P[ i ], P[ j ] := P[ j ], P[ i ]; a[ i ], a[ j ] := a[ j ], a[ i ]; end if; end do; P[ k ], P[ j ] := P[ j ], P[ k ]; a[ k ], a[ j ] := a[ j ], a[ k ]; j; end proc; quickSort := proc( a, lb, ub ) local k; if lb < ub then k := partition( a, lb, ub ); procname( a, lb, k - 1 ); procname( a, k + 1, ub ); end if; a; end proc; n := LinearAlgebra:-Dimensions( V ); P := Array( 1 .. n, [ $ 1 .. n ], 'datatype' = 'integer[ 4 ]' ); quickSort( V, 1, n ); [seq]( P[ i ], i = 1 .. n ); end proc; end module; # The Filters subpackage Filters := module() option package; export Text, XML, Maple, WorksheetTerms; local stopWords; stopWords := { # The 48 most common English words "i", "a", "all", "an", "and", "are", "as", "at", "be", "been", "but", "by", "can", "do", "for", "from", "had", "has", "have", "he", "his", "if", "in", "is", "it", "not", "of", "on", "or", "she", "that", "the", "their", "there", "they", "this", "to", "was", "we", "were", "what", "which", "who", "will", "with", "would", "you", # a few others "thing", "true", "most", "does", "don't", NULL}; Text := proc( text::string ) description "compute the terms in a text string"; local words; use StringTools in words := map( LowerCase, Words( text ) ); words := remove( type, words, stopWords ); map( Stem, words ); end use; end proc; XML := proc( xml ) description "compute the terms in an XML document"; local t, count, rec; rec := proc( xml, t::table ) local cm, texts, text, others; use XMLTools in if IsElement( xml ) then cm := ContentModel( xml ); texts, others := selectremove( IsText, cm ); for text in texts do count := 1 + count; t[ count ] := text; end do; map( procname, others, t ); end if; end use; end proc; count := 0; t := rec( xml, t ); [seq]( t[ i ], i = 1 .. count ); end proc; Maple := proc( expr ) description "compute the terms in a Maple expression"; local fns, terms, nocc; fns := [op]( map2( op, 0, indets( expr, 'function' ) ) ); nocc := map2( numboccur, expr, fns ); terms := [seq]( [ seq( fns[ i ], j = 1 .. nocc[ i ] ) ], i = 1 .. nops( fns ) ); sort( map( op, terms ), 'lexorder' ); end proc; WorksheetTerms := proc( mws ) description "compute the terms in a worksheet"; local rec, wks, count, t, i; rec := proc( x, t::table ) local cm, texts, text, others; use XMLTools in if IsElement( x ) and ElementName( x ) = "Text-field" then cm := ContentModel( x ); texts, others := selectremove( IsText, cm ); for text in texts do count := 1 + count; t[ count ] := text; end do; map( procname, others, t ); elif not type( x, 'string' ) then map( procname, args ); end if; end use; end proc; use XMLTools in if IsDocument( mws ) then wks := select( IsElement, [op]( mws ) ); if nops( wks ) = 1 then wks := wks[ 1 ]; else error "ill-formed worksheet `%1'", fname; end if; end if; end use; count := 0; t := table(); rec( wks, t ); t := [seq]( t[ i ], i = 1 .. count ); use XMLTools in t := map( TextText, t ); t := cat( op( t ) ); Filters:-Text( t ); end use; end proc; end module; end module:
The revised package contains several new document filters. To use document formats that are not directly supported, compose these filters with custom code. Rather than providing a vector of raw scores, the Search command in the package now returns a permutation of the document list indicating document rankings. This can be used directly with the GetDocumentIdByIndex routine.
Using the Package
The package can be used with a variety of corpora. This subsection demonstrates two examples. The first is the corpus of short text strings used previously in this section.
with( SearchEngine ): corpus := Corpus( [ doc || ($1..9) ], id -> id, Filters:-Text ):
GetDocumentIdByIndex( corpus, 1 ); ranking := Search( BuildIndex( corpus ), "mathematical beauty" );
ranking≔1,3,7,5,6,2,9,8,4
map2( GetDocumentIdByIndex, corpus, ranking[ 1 .. 3 ] );
The mathematician's patterns, like the painter's or the poet's must be beautiful; the ideas, like the colors or the words must fit together in a harmonious way. Beauty is the first test: there is no permanent place in this world for ugly mathematics.,Anyone who cannot cope with mathematics is not fully human. At best he is a tolerable subhuman who has learned to wear shoes, bathe, and not make messes in the house.,The most beautiful thing we can experience is the mysterious. It is the source of all true art and science.
The second example corpus is a database of formulae. The intent is to be able to locate formulae relevant to a search query consisting of function names. For the formula database, generate a list of identities among elementary functions using the FunctionAdvisor command.
Formulae := map2( FunctionAdvisor, 'identities', FunctionAdvisor( 'elementary', 'quiet' ), 'quiet' ):
Formulae := map( op, sort( Formulae, 'length' ) ):
nops( Formulae );
138
Use each formula as both the document and its ID. The Maple filter in the Filters subpackage extracts the terms from each document.
corpus2 := Corpus( Formulae, id -> id, Filters:-Maple, query -> [op]( {op}( query ) ) ):
BuildIndex( corpus2 ):
It is possible to locate formulae relevant to, for example, the sin and cos functions.
ranking := Search( corpus2, [ 'sin', 'cos' ] );
ranking≔88,85,87,86,83,91,95,118,119,94,44,113,74,90,38,93,117,21,84,17,57,99,112,122,100,14,4,92,98,97,121,114,116,115,41,73,42,70,77,102,29,110,101,43,8,23,72,15,24,16,1,20,5,7,11,13,18,3,53,69,134,34,89,78,111,40,9,22,12,59,60,76,71,127,108,6,133,25,48,64,30,132,106,126,37,35,55,46,31,36,56,45,75,103,105,104,32,33,26,52,62,51,67,128,107,54,96,27,123,28,68,61,47,120,39,49,50,125,131,130,63,58,124,65,2,66,109,80,136,129,10,135,137,138,79,82,81,19
map2( GetDocumentIdByIndex, corpus2, ranking[ 1 .. 3 ] );
ln⁡za⁢yb=a⁢ln⁡z+b⁢ln⁡y+2⁢I⁢π⁢π−ℑ⁡a⁢ln⁡z−ℑ⁡b⁢ln⁡y2⁢π,ln⁡z+y=ln⁡y+ln⁡12⁢y+z+ln⁡2⁢y+z+2⁢arctanh⁡z2⁢y+z,0<y,ln⁡zy=y⁢ln⁡z+2⁢I⁢π⁢π−ℑ⁡y⁢ln⁡z2⁢π
Construct a similar corpus using a different choice for the document IDs and the fetcher routine passed to the constructor. Instead of using formulae for both the document and its ID, use the position of the formula in the global list Formulae as the document ID, and pass a suitable fetcher routine.
corpus3 := Corpus( [$1..nops(Formulae)], id -> Formulae[ id ], Filters:-Maple, query -> [op]( {op}( query ) ) ):
The common and practical case, in which a corpus represents a collection of files to be indexed, can be handled by using a constructor call such as the following.
corpus := Corpus( remove( type, listdir( "MyDocuments" ), { ".", ".." } ), fname -> readbytes( fname, 'TEXT', infinity ), Filters:-Text ):
If the documents contain structured text encoded as XML, then a similar invocation can be used.
corpus := Corpus( remove( type, listdir( "MyDocuments" ), { ".", ".." } ), fname -> XMLTools:-ParseFile( fname ), Filters:-XML ):
Finally, a directory of Maple worksheets can be represented by a corpus constructed as follows.
corpus := Corpus( remove( type, listdir( "MyDocuments" ), { ".", ".." } ), fname -> Worksheet:-ReadFile( fname ), Filters:-WorksheetTerms ):
A client of the SearchEngine package can provide a specialized filter routine to be used in constructing a corpus object to represent a collection of documents of a specific type. Generic interfaces and careful hiding of representational details provide considerable client flexibility and the ability to evolve the implementation.
Return to the Index of Example Worksheets.
See Also
StringTools
Download Help Document