Search Guide

Creating indexes for the search engine is a much simpler exercise than your database. The query planner simply uses every single index available to process the query by performing set operations on the skip list indexes. For the best performance, you must create an index on every field which you intend to perform exact value matches as well as range scans and sort operations. The search engine cannot perform a text search without creating a text index.

The search engine uses an index to process $exist queries. You do not have to create an index explicitly for $exist queries as the search engine auto creates an index which indexes all attirbute names in the document.


This feature is only available from your Nodechef Cloud Search instance. None of the examples provided below will work on a mongodb database. To deploy a cloud search instance, login and navigate to the dashboard. From deployments > deploy cloud search, you can then spin up a cloud search server

Creating non text indexes

Indexes are typically created from the dashboard. However for those who prefer to code everything, you can do so as well from Mongo shell or using the driver.

db.people.createIndex( { age : "INT32" } ); db.people.createIndex( { age : "INT32", height : "DOUBLE" } ); db.comments.createIndex( { post_id : "BSONID" } ) db.comments.createIndex( { _created_at : "DATETIME" } ) # Create a unique index on an email field. db.comments.createIndex( { email : "STRING" }, { unique : true } ) # Use 1 and -1 to create mongodb style indexes. Only useful when indexing arrays and an entire child document. db.products.createIndex( { shipping_address : 1 } )

Index Data Types

  • BSONID
  • BOOLEAN
  • DOUBLE
  • DECIMAL
  • DATETIME
  • INT32
  • INT64
  • FLOAT
  • 2DSPHERE
  • STRING

Remarks

Create a single dimensional typed index. The corresponding value in the document will be cast to the specified type at index time. You are not required to manage types in the document itself.

The 2DSPHERE data type is used to index geojson shapes.

We highly encourage you use typed indexes such as int32 and int64 for fields where the type is known at document insertion time. Typed indexes are space efficient and significantly improves search performance.



Creating text indexes

Cloud search uses lucene Stop words. The following languages are supported: SV, ES, RU, PT, NO, IT, HU, DE, FR, FI, EN, NL, DA, TR, TH, RO, LV, ID, HY, HI, GL, GA, FA, EU, EL, CZ, KU, CA, BR, BG, AR

Cloud search uses a radix tree, which allows you to efficiently search terms without having to stem.


# Create a text index using english stop words db.records.createIndex( { tss : "text lang('en')" } ) # Create an index which does not exclude stop words from the index. db.records.createIndex( { tss : "text include_stopwords lang('en')" } ) # Create a text index to index all string fields in the document. db.records.createIndex( { $** : "text lang('en')" } ) # Set the position_offset_gap to ensure phrase and proximity searches do not span multiple attributes. db.records.createIndex( { $** : "text lang('en') position_offset_gap(100)" } ) # Create a text index to index a sub document. # Consider the document { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" } } db.records.createIndex( { address : "text lang('en') position_offset_gap(100)" } ) # All string elements in the address sub document will be included in the text index. All the above examples creates the text index storing the positions of each term in the index. This can significantly use up alot of memory when in many cases, an exact phrase search is not required in the application. For most applications performing an OR query together with the ranking provided should be more than good enough to provide accurate results. Unless you intend to perform exact phrase matches or proximity searches, we advice you use the without_phrase clause to save memory. db.records.createIndex( { address : "text lang('en') without_phrase" } )

Arrays and Offset gap


# Issue: A curious thing can happen when you try to use phrase matching on # multivalue fields. Imagine that you index this document: { "names": [ "John Abraham", "Lincoln Smith"] } # Then run a phrase query for Abraham Lincoln: collection.find( { $text : "\"Abraham Lincoln\"" } ) # Surprisingly, our document matches, even though Abraham and Lincoln belong # to two different people in the names array # Use the position_offset_gap workaround at index time to prevent this issue. db.records.createIndex( { names : "text position_offset_gap(100) stem lang('en')" } ) # See Elastic search documentation for more details.


Search Syntax

Cloud search textx can be used as an operator in a find query or as a database command.

Search syntax in mongodb find query

db.collection.find({ "$textx" : { PHRASE : "eat, drink and make merry" Fuzzy: 1 gap : -2 }, price : { $lt : 39 } }, {}).limit(20)

Search syntax using runCommand

db.runCommand({ textx : <collection_name>, query : { <textx_query> }, filter : { <mongodb_predicate> }, project : { <mongodb_projection> }, limit : <int>, skip : <int>, attribute : <indexed_attribute> }) # Note: Use the attribute to restrict the search on a single index. This is only useful in case where multiple # indexes have been created on the collection. This behaviour is different from MongoDB which allows # only a single text index. # Example: db.runCommand({ textx : "books", query : { AND : "complete guide node.js" }, filter : { in_stock : 1, price : { $lt : 100 } }, limit : 20 })


Boolean operators - (Text Search)


# Search for all documents that contain either Moon or Walk db.videos.find({ $textx : { OR : "Moon Walk" } }) # Search for all documents that contain either Moon or Walk only in the title field. This feature is only applicable where multiple text indexes have been created on the collection db.videos.find({ title : { $textx : { OR : "Moon Walk" } } }) # Search for all documents that contain neither Moon or Walk db.videos.find({ $textx : { NOT : { OR : "Moon Walk" } } }) # Search for all documents that contain Moon and Walk db.videos.find({ $textx : { AND : "Moon Walk" } }) # Search for all documents that contain Moon AND Walk OR Harlem AND Shake db.videos.find({ $textx : { OR : [ { AND : "Moon Walk" }, { AND : "Harlem Shake" } ] } }) # Search for all documents that contain neither Moon AND Walk and Harlem AND Shake db.videos.find({ $textx : { NOT : [ { AND : "Moon Walk" }, { AND : "Harlem Shake" } ] } }) # Search for all documents that contain at least two terms in "3d tv touch screen" # Documents which contain more terms in the set are ranked higher db.electronics.find({ $textx : { ATLEAST : "3d tv touch screen", ratio : 0.5 } }) # The ANY operator accepts an array of search expressions and executes them consecutively until # and expression returns a result. # Search for documents which contain the phrase "Moon Walk". if no matches are found, search # for documents that contain both "Moon Walk" but not as a phrase db.videos.find({ $textx : { ANY : [ { PHRASE : "Moon Walk" }, { AND : "Moon Walk" } ] } })


Phrase & Proximity search


# Search for all documents that contain the phrase "Past and present" db.books.find({ $textx : { PHRASE : "Past and present" } }) Search for all documents that contain the phrase "Past and present" and the term "Lecture" db.books.find({ $textx : { AND : [ { phrase : "Past and present" }, "Lecture" ] } }) # Consider the below indexed document: { name : "John Aaron Smith" } # A phrase search using "John Smith" will not match any documents. However using the gap operator # of 2, "John Smith" will match "John Aaron Smith" db.users.find({ $textx : { PHRASE : "John Smith", GAP : 2 } }) # What about the case when the search input is "Smith John"? # Using a negative $gap value will match "John Aaron Smith" db.users.find({ $textx : { PHRASE : "Smith John", GAP : -2 } })


Fuzzy Search

Fuzzy query allows for matching terms that might be spelt incorrectly. Cloud search supports similarity based Levenshtein edit distance.


Using Levenshtein edit distance

# Syntax # Specify the number of misspelled characters to tolerate { fuzzy : "<total_misspelled characters>" } # Examples: # Input search: "Oister" # Intent: "Oyster" db.recipes.find({ $textx : { OR : "Oister", FUZZY : 1 } }) # Evaluate the LED algorithm using the below command. This is useful to understand the # behavior of the LED algorithm db.runCommand({ textx : "recipes", fuzzy : { term : "oister", limit : 32, distance : 1 } })

Spellchecking - Projecting the terms that were used to compute the search results

This feature is useful when the search engine must notify the end user the input query contains terms possibly misspelled. A classic example is google search which notifies end users with did you mean suggestions.

db.recipes.find( { $textx : { OR : "Oister", FUZZY : 1 } }, { SpellCheck : { $meta : "fuzzyMatches" } } ).limit(20) # The above query will output the below document { _id : ObjectId("507f191e810c823293de860ea"), SpellCheck : [ { given : "Oister", found : "Oyster" } ] }


Prefix Search

The search engine by default treats all terms in the query as a prefix. Using the Radix tree, the search engine is able to store and query prefixes efficiently. You do not have to issue a special query to perform a prefix search. Querying for the term "love" will match any document containing the term "lovely" as well.



Highlighting & Snippetting

Highlight matching terms in the search field. Cloud search uses a postings-highlighter. The highlighter supports the following parameters:

tagBefore & tagAfter - The matching term is enclosed in the values supplied in these fields

terms_before_range & terms_after_range - The number of terms to include before the first and after last matching term respectively. Use these fields to show snippets of the document to the end user

db.recipes.find( { $textx : { OR : "Oister", FUZZY : 1 }, tags : "seafood", price : { $lt : 15 } }, { snippet : { $meta : { postingHighLighter : 1, tagBefore : "<strong>", tagAfter : "</strong>" terms_before_range : 6, terms_after_range : 6 } } } )


Type-ahead suggestions

Type-ahead suggestions does not require a seperate command. This section provides possible examples of query combinations that can be used to implement type ahead suggestions

# Implementing type ahead suggestions for single terms. Eg: A dictionary application. Issue the below command. # The prefix operator is currently only supported for the trie fuzzy matching algorithm. db.runCommand({ textx : "my_dictionary_collection", fuzzy : { term : "ambi", limit : 32, distance : 2, // Specify the edit distance } }) # The above query will output a document with the below structure. The search engine treats all terms # as prefixes { values : [ "ambiguous", "ambitious" ], elapsedMs : 0, ok : 1 } # Implementing type ahead suggestions for multiple terms db.profiles.find( { Name : { $textx : { AND : "John Aa", FUZZY : 1 } } }, { Name : 1 } ).limit(8)

Remarks

The above query performs a "AND" boolean query on the Name attribute, selects the Name field to display on the client UI. On every key stroke, you can fire this query to the search engine and display the list of results returned. Note, you could also use the "PHRASE" operator as well depending on your use case. In some cases when you are searching on fields containing a large amount of terms such as a description field, you can use the postingHighlighter described in the previous section (Highlighting & Snippetting) to select only the matching part of the text.


Boosting

Static Boosting

Static boosting allows you to boost a specific index at query. A typical scenario where you will need this feature will be when tuning the search results to increase the relevance of terms that appear in a specific field

# Consider there is an index on the description and title field of our collection named records. # The search engine to be implemented has a requirement to boost the relevanace of documents that contains # matching terms of the end user's query in the title attribute. # The staticBoost feature in Cloud search allows us to easily accomplish this and fine tune the boost value # without having to reindex any documents. #Example: db.recipes.find({ $or : [ { title : { $textx : { OR : "complete guide node js", staticBoost : { value : 1.5 score : "*" } } }, { description : { $textx : { OR : "complete guide node js" } } } ] }) # Possible values for staticBoost.score include: # + (Addition), # * (Multiplication) # Default value is * when ommitted in the staticBoost document.


Boosting using expressions

Cloud search supports boosting using C# expressions. We elected to use C# for the boosting expressions as it turned out to be significantly faster than javascript on all benchmarks. This is very crucial as the boosting expression must be applied to millions of matches in some cases. You do not have to be a c# expert to use this feature. More than often, boosting expressions are as simple as one line expressions using the built in c# the math library.

# Syntax: Boost: { exp : <expression_to_execute>, field : <indexed_boosting_field>, cache : <cache_the_compiled_expression_for_reuse>, param : <pass_any_value_into_scoring_function> } # Example: # For the below example to work, the number_of_votes attribute must be indexed. # Only fields which are of type int, long, double, decimal, float, datetime, geopoint can be referenced in a # boosting expression db.books.find({ $textx : { OR : "complete guide node js", Boost : { exp : "return current_score + Math.Log(1 + 0.1 * value.ToInt32())" field : "number_of_votes", cache : true } } }) # Generated C# function for the above expression: public static double ExprBoost(double current_score, DBValue value, DBValue param) { return current_score + Math.Log(1 + 0.1 * value.ToInt32()) }
Parameter Description
current_score The current score of the document computed using Nodechef cloud search practical scoring function which uses tf/idf and a coordination factor. This function is similar to lucene's practical scoring function
value Contains the value from the document referenced in the boost.field attribute in the search document. From the above example, value will contain the number_of_votes attribute from each document. Only indexed fields can be referenced. This is crucial for performance.
param Allows you to pass a parameter into the scoring function. This parameter can be of type int32, int64, double, datetime, string and array
cache When set to true, caches the compiled expression for reuse on subsequent queries

Methods of the DBValue struct

  • intToInt32()
  • longToInt64()
  • doubleToDouble()
  • floatToFloat()
  • decimalToDecimal()
  • datetimeToDateTime()
  • stringToString()
  • DBPointToDBPoint()
    DBPoint has the properties: GetX (x/latitude) and GetY (y/longitude).
  • List<DBValue>GetList()


Facets & SQL Aggregations

Facets are built into aggregation queries. Use SQL Select GROUP BY to retrieve facets

# A non unique index is required on the GROUP BY field for best performance. db.runCommand({ select : "SELECT make, count(*) FROM docs WHERE type = 'suv' GROUP BY make" }); # Retrieving multiple facets in a single command. The queries are executed in parallel. db.runCommand({ multiSelect : 1, parallelExec : 1, statements : [ "select make, count(*) from listings_collection group by make", "select model, count(*) from listings_collection group by model", "select fuel_type, count(*) from listings_collection group by fuel_type" ] })