You are currently viewing the "Outline" mode of a pure HTML presentation developed using S5. This presentation can be downloaded in its entirety as a single zip file.
To view the presentation as a slide show, you can click the ± in the upper right hand corner of the page. To control the slide-show, mouse over the lower section of the screen to make the HUD controls visible, or hit space to advance the slides.
This help text will be hidden when printing this presentation.
Solr is "Search platform" that you can put documents into, and can then query in a variety of to get lists documents (and metadata about those lists of documents) back out via HTTP.
The last example demonstrates using "Local Params" in a query string, in particular using variables to refer to other param values. Note also that defType is specified as a local param to control which parser is used when processing the nested query.
TF-IDF(ish) Scoring
This is the default scoring model in Lucene/Solr (via DefaultSimilarity) but other scoring models (and other Similarity implementations) exist.
Thanks to Code Cogs for the equation SVG file (source)
Any specific rules about your data that wouldn't be a suitable in a generic IR scoring algorithm.
In many domains of data, there are fundamental numeric properties that make some objects generally "better" then others.
You are the "Subject Matter Expert" not some soul-less algorithm.
Popularity by association might involve computing a popularity for books that not only takes into account how often people buy that book, but also the overall sales rank of the author. Likewise, some "genres" (or categories) of books may be inherently more popular then others, so books in those genres might be ranked higher.
Examples of "manual rankings" might be a Newspaper editor recognizing that a particular story is a big scoop, so she "boosts" its scores for a few weeks; Or a librarian might see that a certain author was recently in the news, and boost all books written by that author for a few days to help patrons find them.
These are all real examples of sledgehammers I was "asked" to use at various times in my career before I started working with Solr.
"red_sledgehammer" created by Neil Robinson, available under Public Domain.
Just to be crystal clear, even if you are reading these slides at a later date: this is an example of something I was forced to do in the dark ages before Solr existed -- there is no excuse for anyone to resort to ridiculous bullshit like this in "The Modern Era"
This elegant palate cleansing fractal image is available from Wikimedia Commons under the Creative Commons licenses.
An example of using SweetSpotSimilarity -- the reason it was created in fact -- is dealing with noisy data in a product catalog. You can customize the length normalization factor to have a "sweet spot" based on how long you expect a typical product name to be, rewarding documents for having a name that is not too short and not to long. Likewise you can customize the tf() factor in the score to mitigate the effects of keyword stuffing.
Apply domain knowledge based on numeric properties by multiplying functions directly into the score.
This example assumes that every document has a non-zero value for the rating, cat_rank, and price fields. The "map" function is used here to ensure that if a document has no recent_clicks we treat it the same as if it had 1. (The sum function could have worked just as well in this case)
Boost functions can easily complicated when dealing with fields that might have values <= 0. Remember: the more you can pre-compute in your index to simplify the query time calculations, the faster your searches will be.
Thanks to Code Cogs for the equation SVG file (source)
Extremely narrow scalpel for dictating exactly where specific documents should rank for specific queries regardless of scores.
Collection, measurement, and analysis of user data to better understand them -- either in the aggregate or individually.
My description/definition of "User Analytics" is based heavily on the definition of "Web analytics" from Wikipedia.
"red_sledgehammer" created by Neil Robinson, available under Public Domain.
A big caveat to remember in any discussion about any technique of personalizing search result scores is that it can have some fairly significant impacts on the performance of searches -- particularly in terms of how effective caching can be since every personalized query may be completely unique.
"Bucketing" your users can help mitigate some of the performance trade off, particularly if a large percentage of your users are completely anonymous and always fall in a single bucket.
Sam's recent search activity has heavily focused on Laptops, with filter/sort choices indicating that low price is a key differentiator.
Sally's recent search activity has been less focused, but has mainly involved Apple products. Her filter/sort choices indicate that she is primarily concerned with user ratings.
Note in particular that the "personalization function" in these last two examples is fixed (and could be hard coded in our solrconfig.xml). Its the input to these functions that will vary per-request based on what we know about each user. Default values can even be configured to eliminate the personalization function from the equation when no information about the current user is available (eg: conf=0)
In my experiences, even inputs like pref and diff typically come from fixed sets of possible output from the available analytics, but the confidence (conf) in those values would be extremely variable.
This idea is inspired by the length normalization function of SweetSpotSimilarity.
Thanks to Code Cogs for the equation SVG file (source)
Graph generated using this gnuplot file.
Steve Recently clicked on the $1000-1200 price constraint