The Many Facets Of Apache Solr

Chris Hostetter


2010-11-05


http://people.apache.org/~hossman/apachecon2010/

http://lucene.apache.org/solr/

Why Are We Here?

What is Solr?

Elevator Pitch

Solr is a highly scalable open source enterprise search server based on the Lucene Java search library, with HTTP APIs, caching, replication, and a web administration interface.

Solr In A Nutshell

What is Faceted Searching?

Example: CNET

Example: The Smithsonian

Example: Lucid Imagination

Aka: “Faceted Browsing”

Interaction style where users filter a set of items by progressively selecting from only valid values of a faceted classification system.
Keith Instone, SOASIS&T, July 8, 2004

Key Elements of Faceted Search

Explaining My Terms

http://facetmap.com/glossary/ is the source of quote in my Facet definition.

They have a different term for "Constraint" which I don't like as much.

Solr Facets

Preliminaries: Solr Params

For legibility, all examples of SolrParams in the remainder of this presentation will be non-url escaped, with one param per line.

facet=true

facet.query=...

facet.query = rank:[* TO 20] facet.query = rank:[21 TO *]

facet.query Results

<result numFound="27" ... /> ... <lst name="facet_counts"> <lst name="facet_queries"> <int name="rank:[* TO 20]">2</int> <int name="rank:[21 TO *]">15</int> </lst> ...

facet.field=...

facet.field = color facet.field = category

facet.field Results

<lst name="facet_counts"> ... <lst name="facet_fields"> <lst name="color"> <int name="red">17</int> <int name="green">6</int> <int name="blue">2</int> <int name="yellow">2</int> <int name="fuscia">0</int> <int name="teal">0</int> ...

facet.field Options

facet.date=...

facet.date = pubdate facet.date.start = NOW/YEAR-1YEAR facet.date.end = NOW/MONTH+1MONTH facet.date.gap = +1MONTH

facet.date Results

<lst name="facet_counts"> ... <lst name="facet_dates"> <lst name="pubdate"> <int name="2009-01-01T00:00:00Z">4</int> <int name="2009-02-01T00:00:00Z">6</int> <int name="2009-03-01T00:00:00Z">0</int> <int name="2009-04-01T00:00:00Z">13</int> ... <int name="gap">+1MONTH</int> <date name="end">2010-12-01T00:00:00Z</date>

facet.date Options

Tips & Tricks

"Pretty" facet.field Terms

<tokenizer class="solr.PatternTokenizerFactory" pattern="(,|;)\s*" /> <filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " /> <filter class="solr.PatternReplaceFilterFactory" pattern=" and " replacement=" &amp; " /> <filter class="solr.TrimFilterFactory" /> <filter class="solr.CapitalizationFilterFactory" onlyFirstWord="false" />

If you can use unique identifiers (instead of pretty, long, string labels) it can reduce the memory and simplify the request parsing (see next Tip) but it adds work to your front end application -- keeping track of the mapping between id => pretty label. If your application already needs to knows about these mappsings for other purposes, then it's much simpler to take advantage of that.

"Pretty" facet.field Results

<field name="category"> books and magazines; computers, </field>
<lst name="category"> <int name="Books &amp; Magazines">1</int> <int name="Computers">1</int> </lst>

raw QParser

fq = {!raw f=category}Books & Magazines

This is utilizing Solr's LocalParams Syntax to embed metadata directly into the fq param value. {!raw} is the short form of {!type=raw}

You could also alter the default parser, but it's unlikely you would want all of your query params parsed with the raw parser by default.

One potential pitfall with using the raw QParser is if you facet on Numeric fields that utilize an encoded representation. (ie: the "TrieFoo" or "SortableFoo" Field Types. The RawQParser expects truly "Raw" Terms, but for encoded numeric types the term you get in the facet response is the "external" value, and RawQParser won't convert that to the internal value. The field QParser may be a better choice in those situations (it's the one Yonik recommends) -- However if you have a Query Analyzer that is not idempotent (a situation that's easy to get in w/o realizing it) it's very possible to get incorrect results. The term QParser discussed later will be the best of both worlds.

Multi-Select Facets: Example

Screenshot Source: http://search.lucidimagination.com/search/?q=Range+Facets+#/p:solr

Clicking a check box for a Constraint in one Facet (like Projects) causes the result set to change, and affects the counts for Constraints in other Facets (like Source) but it does not affect the counts for other Constraints within the same Facet.

This allows the user to select multiple check boxes for the same Facet, getting a result set which contains the Union of those Constraints.

Multi-Select Facets

q = Range Facets fq = {!df=project tag=px}Solr facet.field = {!ex=px}project facet.field = {!ex=sx}source facet.query = {!ex=qx}popularity:[100 TO *]

This is another example of using the LocalParams Syntax, but we are't overriding the QParser used. In the case of the facet.field params no QParser is used at all.

Same Facet, Different Exclusions

q = Hot Rod fq = {!df=colors tag=cx}purple green facet.field = {!key=all_colors ex=cx}colors facet.field = {!key=overlap_colors}colors

In this example, we are assuming that the "colors" field is multivalued, and each doc can have many colors listed.

As of Solr 1.4.1, specifying a key for a facet does not allow you to specify alternate params using the "per-field" syntax

Same Facet, Different Exclusions

<lst name="facet_fields"> <lst name="all_colors"> <int name="red">19</int> <int name="green">6</int> <int name="blue">2</int> ... <lst name="overlap_colors"> <int name="red">7</int> <int name="green">6</int> <int name="blue">1</int> ...

facet.query Labels

facet.query = {!label="Hot!"} +pop:[1 TO *] +pub_date:[NOW/DAY-1DAY TO *]
<lst name="facet_queries"> <int name="{!label=&quot;Hot!&quot;}...">15</int> </lst> ...

You can also use {!key} with facet.query, but then the query string is not included in the response, and your presentation layer won't know what to specify in an fq to filter on that constraint.

Taxonomy Facets

Note that the dotted lines represent the possibility of many more nodes in the Taxonomy, and that documents can be associated with any nodes (not just leaf nodes). We also have some documents associated with more then one node (History of Physics)

Taxonomy Facets: Data

Abbreviations are being used for the Indexed Terms instead of the full category names to save space in the slides.

If every document in our "corpus" was part of the Taxonomy, and "NonFic" is the root node, then we wouldn't need to index a term for it.

If our taxonomy extends up (ie: NonFic has a parent node) or if some documents are not in the Taxonomy at all, then it definitely matters

Taxonomy Facets: Initial Query

facet.field = category facet.prefix = 1/NonFic facet.mincount = 1
<result numFound="164" ... <lst name="facet_fields"> <lst name="category"> <int name="1/NonFic/Sci">2</int> <int name="1/NonFic/Hist">1</int> <int name="1/NonFic/Law">1</int>

Note that some docs may not even be mapped to the Taxonomy, so they aren't included in the counts. In these examples, we're assuming lots of docs in the index, but only the 3 previously mentioned documents are categorized in the Taxonomy.

Taxonomy Facets: Drill Down

fq = {!raw f=category}1/NonFic/Sci facet.field = category facet.prefix = 2/NonFic/Sci facet.mincount = 1
<result numFound="2" ... <lst name="facet_fields"> <lst name="category"> <int name="2/NonFic/Sci/Phys">1</int>

In this example, the nodes were encoded into Terms in such a way that as we drill down the Taxonomy (using facet.prefix) we only ever see the immediate children of the "current" node. We can eliminate the usage of facet.prefix to always see the "active branches" of the Taxonomy, or with some added creativity and redundency in the Terms we index, we can also make it always show N levels below the current node.

Faceting Coarse Date Fields

I would typically recommend a "+1MILLISECOND" fudge factor when indexing your data because it keeps the day, month, and year accurate for all date values, and keeps the faceting simple. Alternatively a "-1MILLISECOND" fudge in the Constraints can also work in some cases, but it changes the "day" (and can have cascading changes to the month and year) so it can confuse people if you just do simple formating/truncation of the output values. It can also be problematic for "Month Faceting" (2010-01-31T23:59:59.999Z +1MONTH = 2010-02-28T23:59:59.999Z +1MONTH = 2010-03-28...)

The new facet.date.include param (discussed later in this presentation) eliminates the problem of overlapping ranges in the facet results, but there is still the issue of constructing a query to apply the corresponding constraint as a filter.

SOLR-355 adds functionality to the Lucene QueryParser to allow mixed usage of '{' and ']' in range queries to make this possible. It has been committed to the trunk, so it should certainly be included in Solr 4.0, and it may be back ported and included in Solr 3.1 as well. SOLR-1896 also aims to simplify this in a way that will be trivial to use in conjunction with facet.date.include, but no patch is currently available.

Performance Gotchas

facet.method=...

Lucene (and Solr) are "Inverted Indexes" - given a Set of "Docs->Fields" mappings, the Fields are parsed into Terms and a "Terms->Docs" data structure is used for fast lookups. fieldCache and fieldValueCache are essentially "Inverted, Inverted Indexes" (or to put it another way: "UnInverted Indexes") supporting fast Doc->Term(s) lookups.

An example when facet.method=enum is a good choice is something like a us_state field. No matter how many documents you have in your index, or how many documents match a given search request, the number of distinct Terms (and the number of set intersections) will never be more then 50.

Another example when facet.method=enum is a good choice is when you run out of RAM trying to build the fieldValueCache. (Common when trying to facet on a Full-Text field)

facet.enum.cache.minDf=...

facet.method=enum and facet.enum.cache.minDf=100 (or higher) is basically the only way to feasible Facet on a Full-Text field.

Cache Settings

<filterCache class="solr.FastLRUCache" size="1024" initialSize="1024" autowarmCount="1024"/>

Sizing caches appropriately depends on how many differnet fields you expect to Facet on, how many distinct Facet Queries, etc.... Step one towards picking good numbers is looking at your caching stats before and after typical queries. If a single request, on an isolated dev instance that you just started up, causes any evictions: your cache is too small. Either increase the size, or get rid of it, it's just going to slow you down.

autowarmCount controls how many values are pre-computed for each new cache instance whenever the old one needs to be thrown out (ie: a new index is loaded). The items to pre-compute are selected from the "top" keys of the existing cache. Using autowarming for fieldValueCache is qustionable because the objects are so big -- one stray request to Facet on an atypical field could eat up a lot of memory until you completley shut down Solr.

Static Warming

<listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <str name="q">*:*</str> <str name="facet">true</str> <str name="facet.field">category</str> ...

Static warming queries should simulate real queries as much a possible, but there is no need for a lot of overlap. The key is to ensure that every field you expect to Facet on is hit, so that the caches used are properly seeded.

newSearcher events automaticly trigger "auto warming" for caches that configure it, but fieldCache doesn't currently support auto warming, and as mentioned previously there are performance reasons why auto warming fieldValueCache may not be wise. (Those same reasosn would apply to fieldCache even if it did support auto warming)

Coming Soon(ish)

term QParser

fq = {!term f=category}Books & Magazines fq = {!term f=weight}1.56

SOLR-2113 Tracks this functionality. It has already been committed to the trunk, so it should certainly be included in Solr 4.0, and it will likely be back ported and included in Solr 3.1 as well.

facet.range=...

Same semantics as facet.date, but also works on any numeric field type (that supports range queries)

facet.range = popularity facet.range.start = 0.0 facet.range.end = 100.0 facet.range.gap = 33.333334

This funcionality will be included in Solr 3.1

FYI: Because of this more generalized functionality, facet.date will likely be deprecated in future releases.

facet.range.include=...

New multivalued param for controlling which end points of the Constraint ranges are inclusive.

This funcionality will be included in Solr 3.1

FYI: facet.date.include has also been added.

Pivot Facets

<lst name="cat,inStock"> <lst name="electronics"> <int name="true">10</int> <int name="false">4</int> </lst> <lst name="memory"> <int name="true">3</int> <int name="false">0</int> </lst> ...

This feature is being tracked in SOLR-792. It has already been committed to the trunk, so it should certainly be included in Solr 4.0. It may be back ported and included in Solr 3.1 as well, depending on demand and complexity in portin.

Native Taxonomy Facets

This feature is being tracked in SOLR-64. A patch is available but it has not yet been committed, so it is not certain what release it may be in.

I find the output format currently used by the patch to be too confusing to even show on this slide.

Questions?