You are currently viewing the "Outline" mode of a pure HTML presentation developed using S5. This presentation can be downloaded in its entirety as a single zip file.

To view the presentation as a slide show, you can click the ± in the upper right hand corner of the page. To control the slide-show, mouse over the lower section of the screen to make the HUD controls visible, or hit space to advance the slides.

This help text will be hidden when printing this presentation.

Boosting & Biasing

Using Domain Knowledge and
User Analytics in Apache Solr

Chris Hostetter

ApacheCon EU 2012




Solr Primer

Solr is "Search platform" that you can put documents into, and can then query in a variety of to get lists documents (and metadata about those lists of documents) back out via HTTP.

Solr Queries

q = +title:nightfall author:asimov^3
q = +nightfall asimov defType = dismax qf = title author^3 sort = user_ratings desc, score desc
qq = nightfall q = {!boost b=$b defType=dismax v=$qq} b = prod(popularity,awards)

The last example demonstrates using "Local Params" in a query string, in particular using variables to refer to other param values. Note also that defType is specified as a local param to control which parser is used when processing the nested query.

IR Primer: TF-IDF?

IR Primer: TF-IDF

TF-IDF(ish) Scoring

This is the default scoring model in Lucene/Solr (via DefaultSimilarity) but other scoring models (and other Similarity implementations) exist.

Thanks to Code Cogs for the equation SVG file (source)


Domain Knowledge

Any specific rules about your data that wouldn't be a suitable in a generic IR scoring algorithm.

In many domains of data, there are fundamental numeric properties that make some objects generally "better" then others.

You are the "Subject Matter Expert" not some soul-less algorithm.

Domain Knowledge Examples

More Subtle Examples

Popularity by association might involve computing a popularity for books that not only takes into account how often people buy that book, but also the overall sales rank of the author. Likewise, some "genres" (or categories) of books may be inherently more popular then others, so books in those genres might be ranked higher.

Examples of "manual rankings" might be a Newspaper editor recognizing that a particular story is a big scoop, so she "boosts" its scores for a few weeks; Or a librarian might see that a certain author was recently in the news, and boost all books written by that author for a few days to help patrons find them.

Using A Sledge Hammer

Ignore Score, Sort on X
sort = pub_date desc q = AOL Time Warner
Filter by X, Retry if 0 Results.
fq = rev_date:[NOW-6MONTH TO NOW] q = PowerBook
Magic Keyword Stuffing
q = __zz_catDoesNotSuck^1000 +(Windows 2000)

These are all real examples of sledgehammers I was "asked" to use at various times in my career before I started working with Solr.

"red_sledgehammer" created by Neil Robinson, available under Public Domain.

In The Dark Ages...

Palm 3 Palm 3 Palm 3 Palm 3 Palm Inc Palm Inc __zz_price_500_700 __zz_rating_1 __zz_rating_2 ... __zz_rating_7 __zz_pop_90
q = +(Palm) __zz_rating_1 ... __zz_rating_10 __zz_pop_10 __zz_pop_10 ... (10 times) ... __zz_pop_90 __zz_pop_90 __zz_pop_100

Just to be crystal clear, even if you are reading these slides at a later date: this is an example of something I was forced to do in the dark ages before Solr existed -- there is no excuse for anyone to resort to ridiculous bullshit like this in "The Modern Era"

Palate Cleanser

This elegant palate cleansing fractal image is available from Wikimedia Commons under the Creative Commons licenses.

Work With The Algorithm

An example of using SweetSpotSimilarity -- the reason it was created in fact -- is dealing with noisy data in a product catalog. You can customize the length normalization factor to have a "sweet spot" based on how long you expect a typical product name to be, rewarding documents for having a name that is not too short and not to long. Likewise you can customize the tf() factor in the score to mitigate the effects of keyword stuffing.

Boost Functions and Queries

Apply domain knowledge based on numeric properties by multiplying functions directly into the score.

qq = the user input q = {!boost b=$b defType=dismax v=$qq} b = div($good,bad) good = prod(rating,map(recent_clicks,0,0,1) bad = prod(price,cat_rank)

This example assumes that every document has a non-zero value for the rating, cat_rank, and price fields. The "map" function is used here to ensure that if a document has no recent_clicks we treat it the same as if it had 1. (The sum function could have worked just as well in this case)

Boost functions can easily complicated when dealing with fields that might have values <= 0. Remember: the more you can pre-compute in your index to simplify the query time calculations, the faster your searches will be.

Thanks to Code Cogs for the equation SVG file (source)


<fieldType name="cat_ext" keyField="cat_id" class="solr.ExternalFileField"/> ... <field name="cat_rank" type="cat_ext" />


Extremely narrow scalpel for dictating exactly where specific documents should rank for specific queries regardless of scores.

<elevate> <query text="iPhone"> <doc id="IPHONE5" /> <doc id="IPHONE4S" /> <doc id="SAMSUNGGALS" exclude="true" /> </query> </elevate>

User Analytics
& Personalization

User Analytics

Collection, measurement, and analysis of user data to better understand them -- either in the aggregate or individually.

My description/definition of "User Analytics" is based heavily on the definition of "Web analytics" from Wikipedia.

Cheap Metrics / Personalization

Personalization Sledge Hammer

Default Sort
sort = last_sort_field desc
Default Filters
fq = category:last_clicked

"red_sledgehammer" created by Neil Robinson, available under Public Domain.

Personalize With Functions

qq = current search input q = {!boost b=$b v=$qq} b = prod(query("previous search input"), query(category:last_clicked), div(1, last_desc_sort))

A big caveat to remember in any discussion about any technique of personalizing search result scores is that it can have some fairly significant impacts on the performance of searches -- particularly in terms of how effective caching can be since every personalized query may be completely unique.

"Bucketing" your users can help mitigate some of the performance trade off, particularly if a large percentage of your users are completely anonymous and always fall in a single bucket.

Parametrized Personalization

Sam's recent search activity has heavily focused on Laptops, with filter/sort choices indicating that low price is a key differentiator.

q = {!boost b=$b v=$qq} b = pow(prod(sum(1, query($pref)), $diff), $conf) pref = category:laptop diff = div(1, price) conf = 2.3

Parametrized Personalization

Sally's recent search activity has been less focused, but has mainly involved Apple products. Her filter/sort choices indicate that she is primarily concerned with user ratings.

q = {!boost b=$b v=$qq} b = pow(prod(sum(1, query($pref)), $diff), $conf) pref = mfg:Apple diff = avg_user_ratings conf = 0.6

Note in particular that the "personalization function" in these last two examples is fixed (and could be hard coded in our solrconfig.xml). Its the input to these functions that will vary per-request based on what we know about each user. Default values can even be configured to eliminate the personalization function from the equation when no information about the current user is available (eg: conf=0)

In my experiences, even inputs like pref and diff typically come from fixed sets of possible output from the available analytics, but the confidence (conf) in those values would be extremely variable.

Wanna Get Freaky?

Sweet Spot Plateaus

This idea is inspired by the length normalization function of SweetSpotSimilarity.

Thanks to Code Cogs for the equation SVG file (source)

Graph generated using this gnuplot file.

Example Plateau

Steve Recently clicked on the $1000-1200 price constraint

q = {!boost b=$b v=$qq} b = div(1,sqrt(sum(1,$mult))) mult = prod(0.001,sub($range, sub($max,$min)) range = sum(abs(sub($bias,$min)), abs(sub($bias,$max))) bias = price min = 1000 max = 1200

In Conclusion

Key Take Aways