Start Slideshow

This document is designed to be viewed as a reveal.js slide show presentation. You can either read/print the contents of the presentation (and the speaker notes) below, or click here to view slideshow.

(Please consult the reveal.js docs for information on how to control the slideshow once it's started)

A ZIP file containing this presentation is available for offline perusal.

There is also a video of this talk available online.

Hidden Gems

Getting More Out Of Apache Solr

Berlin Buzzwords - 2014-05-27


https://people.apache.org/~hossman/bbuzz2014/

https://twitter.com/_hossman/

http://www.lucidworks.com/


Monitoring

Admin UI

Solr Admin UI Screenshot: filterCache Stats

JMX

jConsolr Screenshot: filterCache Stats

HTTP Admin APIs

{
  "filterCache":{
    "stats":{
      "lookups":27,
      "hits":23,
      "hitratio":0.85,
      "inserts":4,
      "evictions":0,
      "size":4,
      "warmupTime":0,
      "cumulative_lookups":31,
      "cumulative_hits":25,
      "cumulative_hitratio":0.81,
      "cumulative_inserts":6,
      "cumulative_evictions":0}},

facet.method

fc vs. fcs

  • Both iterate over the matching documents and increment counters per-term
  • fc (Default)
    • Single FieldCache (or UnInvertedField) over entire index
    • Typically faster look-ups than fcs once un-inverted structure is built -- no per-query merging needed
  • fcs
    • FieldCache per index segment
    • Typically faster to re-build than fc in NRT situations -- only modified segments need built
    • Per-segment FieldCache is also used in sorting -- re-use in faceting may reduce total heap usage.

enum

  • Enumerates all terms in the field and computes a set intersection with the matching documents
  • Leverages the filterCache
  • Small Cardinality Fields
    • Cached document sets may use less RAM than FilterCache
    • fq constraints on the same field will re-use the cached document sets
  • High Cardinality Fields
    • FieldCache / UnInvertedField may not fit in RAM
    • Slower enum can still be used to get counts
    • Use facet.enum.cache.minDf to minimize filterCache churn
    • Can be used for faceting on full-Text fields to build tag clouds

Result Clustering

Voronoi treemap visualization from FoamTree

Clustering Component

{
  "clusters":[{
      "labels":["Environmental"],
      "score":6.393107732455205,
      "docs":["9781901362930", "9781841130903",
              "9781841130897", "9781841133607"]},
    {
      "labels":["Human Rights"],
      "score":15.667620783438327,
      "docs":["9781841130354", "9781841134574",
              "9781841136530"]},
    {
      "labels":["Anatomy of Tort Law"],
      "score":11.181459329239996,
      "docs":["9781901362091", "9781901362084"]},
    {
      "labels":["Litigation"],
      "score":8.560711128059928,
      "docs":["9781841132983", "9781841134574"]},
  ...

Function Boosting &
Personalized Scoring

Basic Function Boosting

      q = Nightfall Isaac Asimov
defType = edismax
  boost = div( popularity, add(1,price) )
      q = {!boost b=$my_func v=$qq} 
     qq = +title:Nightfall author:"Isaac Asimov"
my_func = div( popularity, add(1,price) )

Custom Category Boosts Per User

  • Accumulate data on how much each of your users like/dislike various categories
  • Batch process for every user:
    • A normalized "Z-Score" preference for each category
    • Record the 3 most significant (ie: greatest absolute value Z-Score) categories
  • At query time:
    • Look-up the user's 3 most significant category scores
    • Use the Z-Scores as exponents in a boost function over those category queries
    qq = ...search terms...
     q = {!boost b=$b v=$qq}
     b = prod(pow( query($cat1), $z_cat1),
              pow( query($cat2), $z_cat2),
              pow( query($cat3), $z_cat3))
  cat1 = category:action                    # The user's 3 most significant categories,
z_cat1 = 1.48                               # ... and their Z-scores
  cat2 = category:comedy
z_cat2 = 1.33
  cat3 = category:kids
z_cat3 = -1.7

Defaults, Appends, Invariants, ... Oh My!

Lots of Options = Long URLs ?

http://server:8983/solr/collection_name/select?defType=edismax
  &qf=title^4+authors^3+description&pf2=title,author
  &boost=div(popularity,add(1,price))&sort=score+desc,+price+desc
  &fl=id,title,description,price&fq=instock:true
  &rows=100&start=0&q=Nightfall+Isaac+Asimov

Lots of Options ≠ Long URLs !

http://server:8983/solr/collection_name/select?q=Nightfall+Isaac+Asimov
<lst name="defaults">
  <str name="defType">edismax</str>
  <str name="qf">title^4 authors^3 description</str>
  <str name="pf2">title, author</str>
  <str name="boost">div(popularity,add(1,price))</str>
  <str name="sort">score desc, price desc</str>
  <str name="fl">id,title,description,price</str>
  <str name="fq">instock:true</str>
  <int name="rows">100</int>
  <int name="start">0</int>
</lst>

Prevent Client Mistakes

http://server:8983/solr/collection_name/select?q=Nightfall+Isaac+Asimov
<lst name="invariants">
  <str name="defType">edismax</str>
  <str name="qf">title^4 authors^3 description</str>
  <str name="pf2">title, author</str>
  <str name="boost">div(popularity,add(1,price))</str>
  <str name="sort">score desc, price desc</str>
  <int name="rows">100</int>
</lst>
<lst name="appends">
  <str name="fq">instock:true</str>
</lst>
<lst name="defaults">
  <str name="fl">id,title,description,price</str>
  <int name="start">0</int>
</lst>

Hide Implementation Details

/select?shipping=free_to_members&cat=books&q=Nightfall+Isaac+Asimov
<lst name="invariants">
  <str name="cat_filter">{!term f=category v=$cat}</str>
</lst>
<lst name="appends">
  <str name="fq">instock:true</str>
  <str name="fq">{!switch case.any='*:*'
                          default=$cat_filter
                          v=$cat}
  </str>
  <str name="fq">{!switch case.any='*:*'
                          case.free_to_members='member_shipping:0.0'
                          case.free='shipping_cost:0.0'
                          v=$shipping}
  </str>
</lst>
<lst name="defaults">
  <str name="cat">any</str>
  <str name="shipping">any</str>
</lst>

Query Parsers

The Usual Suspects

  • lucene - Canonical query syntax
  • dismax - Simplified syntax building Disjunction across configured fields
  • edismax - Hybrid: full lucene syntax with configured disjunction field aliases

Specialized Parsers

  • bbox
  • boost
  • child
  • collapse
  • complex
  • field
  • frange
  • func
  • geofilt
  • join
  • maxscore
  • parent
  • prefix
  • query
  • raw
  • simple
  • surround
  • switch
  • term

Hierarchical Documents

(aka: Block Join)

Nested Documents

<add>
  <doc>
    <field name="id">100</field>
    <field name="doctype">album</field>
    <field name="album_name">Wayne's World (soundtrack)</field>
    <doc>
      <field name="id">101</field>
      <field name="doctype">song</field>
      <field name="song_name">Bohemian Rhapsody</field>
      <field name="artist_name">Queen</field>
    </doc>
    <doc>
      <field name="id">102</field>
      <field name="doctype">song</field>
      <field name="song_name">Hot and Bothered</field>
      <field name="artist_name">Cinderella</field>
    </doc>
    ...
  </doc>
  ...
</add>

"Soundtrack" Albums

/select?q=soundtrack&fq=doctype:album

{ "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"100",
        "album_name":"Wayne's World (soundtrack)"},
      {
        "id":"200",
        "album_name":"Empire Records (Soundtrack)"},
      {
        "id":"300",
        "album_name":"Reality Bites (Soundtrack)"}]
  }}

"Love" Songs

/select?q=love&fq=doctype:song

{ "response":{"numFound":7,"start":0,"docs":[
      { "id":"114",
        "song_name":"Loud Love",
        "artist_name":"Soundgarden"},
      { "id":"112",
        "song_name":"Loving Your Lovin'",
        "artist_name":"Eric Clapton"},
      { "id":"406",
        "song_name":"One Year of Love",
        "artist_name":"Queen"},
      { "id":"503",
        "song_name":"Ready For Love",
        "artist_name":"Bad Company"},
      { "id":"532",
        "song_name":"Hammer of Love",
        "artist_name":"Bad Company"},
      ...,

"Love" Songs on Soundtracks

/select?q=love&fq={!child of="doctype:album"}soundtrack

{ "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"114",
        "song_name":"Loud Love",
        "artist_name":"Soundgarden"},
      {
        "id":"112",
        "song_name":"Loving Your Lovin'",
        "artist_name":"Eric Clapton"},
      {
        "id":"314",
        "song_name":"Baby, I Love Your Way",
        "artist_name":"Big Mountain"}]
  }}

Soundtracks containing "Love" Songs

/select?q=soundtrack&fq={!parent which="doctype:album"}love

{ "response":{"numFound":2,"start":0,"docs":[
      {
        "id":"100",
        "album_name":"Wayne's World (soundtrack)"},
      {
        "id":"300",
        "album_name":"Reality Bites (Soundtrack)"}]
  }}

Soundtracks with Both...

  • A Song by Alice Cooper
  • A "Love" Song by Soundgarden
 q = soundtrack
fq = {!parent which="doctype:album"}artist_name:"Alice Cooper"
fq = {!parent which="doctype:album"}(artist_name:Soundgarden AND song_name:Love)
{ "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"100",
        "album_name":"Wayne's World (soundtrack)"}]
  }}

Block Join Caveats

  • Fairly new feature, still evolving (SOLR-5142)
  • Currently only supported as constant score queries (SOLR-5882)
  • Special _root_ field needed to handle deletes when updating a block:
    • Currently has a bug if you "update" a parent document to have no children (SOLR-5211)
    • Doesn't play nicely with deleting by id -- need to use delete by query to ensure all children are removed
  • Coming Soon: Option to include nested child documents in search results (SOLR-5285)

Update Processors

Pipeline Of Reusable Tools

<processor class="solr.CloneFieldUpdateProcessorFactory">
  <arr name="source">
    <str>authors</str>
    <str>editors</str>
  </arr>
  <str name="dest">contributors</str>
</processor>
<processor class="solr.ConcatFieldUpdateProcessorFactory">
  <str name="delimiter">; </str>
  <str name="fieldName">contributors</str>
  </lst>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
  <str name="source">authors</str>
  <str name="dest">primary_author</str>
</processor>
<processor class="solr.FirstFieldValueUpdateProcessorFactory">
  <str name="fieldName">primary_author</str>
</processor>

Script Your Own

<processor class="solr.StatelessScriptUpdateProcessorFactory">
  <str name="script">update-script.js</str>
  <lst name="params">
    <int name="min_popularity">42</int>
  </lst>
</processor>
// in update-script.js
function processAdd(cmd) {
  doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument

  if (params.get("min_popularity") < doc.getFieldValue("popularity")) {
    doc.addField("is_hot","true");
  }
}

Q & A