Why Are We Here?
- What is Solr?
- What is Faceted Searching?
- Solr Facets
- Tips & Tricks
- Performance Gotchas
- Coming Soon(ish)
What is Solr?
Elevator Pitch
Solr
is a highly scalable open source enterprise search
server based on the Lucene Java search library, with HTTP APIs,
caching, replication, and a web administration
interface.
Solr In A Nutshell
- Index/Query Via HTTP
- Comprehensive HTML Administration Interfaces
- Scalability - Horizontal and Vertical
- Extensible Plugin Architecture
- Highly Configurable And User Extensible Caching
- Flexible And Adaptable With XML Configuration
What is Faceted Searching?
Example: Lucid Imagination
Aka: “Faceted Browsing”
Interaction
style where users filter a set of items by progressively selecting from
only valid values of a faceted classification system.
Keith Instone, SOASIS&T, July 8, 2004
Key Elements of Faceted Search
- No hierarchy of options is enforced
- Users can apply facet constraints in any order
- Users can remove facet constraints in any order
- No surprises
-
The user is only given facets and constraints
that make sense in the context of the items
they are looking at
-
The user always knows what to expect before
they apply a constraint
Explaining My Terms
-
Facet: A distinct feature or aspect of a
set of objects; "a way in which a
resource can be classified"
-
Constraint: A viable method of limiting a
set of objects
http://facetmap.com/glossary/
is the source of quote in my Facet definition.
They have a different term for "Constraint" which I don't like
as much.
Solr Facets
Preliminaries: Solr Params
-
Request Based:
?q=solr&fq=inStock:true&hl=true
-
solrconfig.xml Defaults & Overrides:
<lst name="defaults">
<int name="rows">50</int>
</lst>
<lst name="appends">
<str name="fq">access:public</str>
</lst>
<lst name="invariants">
<bool name="hl">false</bool>
</lst>
For legibility, all examples of SolrParams in the remainder of
this presentation will be non-url escaped, with one param per line.
facet=true
-
Enables (or disables) all faceting support
-
Allows you to provide default faceting params in configs,
while keeping faceting off by default
facet.query=...
-
Specifies a query string to be used as a Facet Constraint
-
Typically used multiple times to get multiple (discrete) sets
facet.query = rank:[* TO 20]
facet.query = rank:[21 TO *]
facet.query Results
<result numFound="27" ... />
...
<lst name="facet_counts">
<lst name="facet_queries">
<int name="rank:[* TO 20]">2</int>
<int name="rank:[21 TO *]">15</int>
</lst>
...
facet.field=...
-
Specifies a Field to be used as a Facet
-
Uses each value indexed in that Field as a Constraint
-
Most useful when used on fields with a low cardinality
relative to the number of documents in the index
-
Can be used multiple times for multiple fields
facet.field = color
facet.field = category
facet.field Results
<lst name="facet_counts">
...
<lst name="facet_fields">
<lst name="color">
<int name="red">17</int>
<int name="green">6</int>
<int name="blue">2</int>
<int name="yellow">2</int>
<int name="fuscia">0</int>
<int name="teal">0</int>
...
facet.field Options
- facet.prefix - Restricts the
possible constraints to only indexed values with a specified
prefix.
- facet.mincount=0 - Restricts the
constraints returned to those containing a minimum number of
documents in the result set.
- facet.sort=count - The ordering of
constraints (count vs
index)
- facet.offset=0 - Indicates how many
constraints in the specified sort ordering should be skipped in
the response.
- facet.limit=100 - The number of
constraints in the specified sorted order that should be
returned starting at the specified offset.
facet.date=...
-
Specifies a Date Field to be used as a Facet
-
Creates Constraints based on evenly sized date ranges using
the Gregorian Calendar
-
Ranges are specified using "Date Math" so they DWIM
in spite of variable length months and leap years
facet.date = pubdate
facet.date.start = NOW/YEAR-1YEAR
facet.date.end = NOW/MONTH+1MONTH
facet.date.gap = +1MONTH
facet.date Results
<lst name="facet_counts">
...
<lst name="facet_dates">
<lst name="pubdate">
<int name="2009-01-01T00:00:00Z">4</int>
<int name="2009-02-01T00:00:00Z">6</int>
<int name="2009-03-01T00:00:00Z">0</int>
<int name="2009-04-01T00:00:00Z">13</int>
...
<int name="gap">+1MONTH</int>
<date name="end">2010-12-01T00:00:00Z</date>
facet.date Options
- facet.date.hardend=false -
Determines what effective end date is used when the specified
"start" and "end" don't divide into even "gap" sized
buckets; false means the last
Constraint range may be shorter then the others
- facet.date.other=none - Allows you to
specify what other Constraints you are interested in besides the
generated ranges: before,
after, between,
none, all
Tips & Tricks
"Pretty" facet.field Terms
- Field Faceting uses Indexed Terms
-
Leverage copyField and
TokenFilters that will give you good looking Constraints
<tokenizer class="solr.PatternTokenizerFactory"
pattern="(,|;)\s*" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="\s+" replacement=" " />
<filter class="solr.PatternReplaceFilterFactory"
pattern=" and " replacement=" & " />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.CapitalizationFilterFactory"
onlyFirstWord="false" />
If you can use unique identifiers (instead of pretty, long,
string labels) it can reduce the memory and simplify the
request parsing (see next Tip) but it adds work to your front end
application -- keeping track of the mapping between id =>
pretty label. If your application already needs to knows
about these mappsings for other purposes, then it's much
simpler to take advantage of that.
"Pretty" facet.field Results
<field name="category"> books
and magazines;
computers, </field>
<lst name="category">
<int name="Books & Magazines">1</int>
<int name="Computers">1</int>
</lst>
raw QParser
- Default Query Parser does special things with whitespace and punctuation
-
Problematic when "filtering" on Facet Field Constraints that
contain whitespace or punctuation
-
Use the raw parser to filter on an
exact Term
fq = {!raw f=category}Books & Magazines
This is utilizing Solr's
LocalParams
Syntax to embed metadata directly into the
fq param
value. {!raw} is the short form of
{!type=raw}
You could also alter the default parser, but it's unlikely
you would want all of your query params parsed with the
raw parser by default.
One potential pitfall with using
the raw QParser is if you facet on
Numeric fields that utilize an encoded representation. (ie:
the "TrieFoo" or "SortableFoo" Field Types. The RawQParser
expects truly "Raw" Terms, but for encoded numeric types the
term you get in the facet response is the "external" value,
and RawQParser won't convert that to the internal value. The
field QParser may be a better choice
in those situations (it's the one Yonik recommends) -- However if
you have a Query Analyzer that is not
idempotent
(a situation that's easy to get in w/o realizing it) it's very
possible to get incorrect
results. The term QParser discussed
later will be the best of both worlds.
Multi-Select Facets: Example
Screenshot Source: http://search.lucidimagination.com/search/?q=Range+Facets+#/p:solr
Clicking a check box for a Constraint in one Facet (like
Projects) causes the result set to change, and affects the
counts for Constraints in other Facets (like Source) but it
does not affect the counts for other Constraints within the
same Facet.
This allows the user to select multiple check boxes for the
same Facet, getting a result set which contains the Union of
those Constraints.
Multi-Select Facets
- We can "tag" a filter query with an identifier
-
We can instruct a facet to "exclude" that filter query when
computing Constraint counts.
q = Range Facets
fq = {!df=project tag=px}Solr
facet.field = {!ex=px}project
facet.field = {!ex=sx}source
facet.query = {!ex=qx}popularity:[100 TO *]
This is another example of using the
LocalParams
Syntax, but we are't overriding the QParser used. In the
case of the facet.field params no
QParser is used at all.
Same Facet, Different Exclusions
-
A key can be specified for a facet
to change the name used to identify it in the response.
-
This allows you to have multiple instances of a facet, with
differnet exclusions.
q = Hot Rod
fq = {!df=colors tag=cx}purple green
facet.field = {!key=all_colors ex=cx}colors
facet.field = {!key=overlap_colors}colors
In this example, we are assuming that the "colors" field is
multivalued, and each doc can have many colors listed.
As of Solr 1.4.1, specifying a key for a facet
does not allow you to specify alternate
params using the "per-field" syntax
Same Facet, Different Exclusions
<lst name="facet_fields">
<lst name="all_colors">
<int name="red">19</int>
<int name="green">6</int>
<int name="blue">2</int>
...
<lst name="overlap_colors">
<int name="red">7</int>
<int name="green">6</int>
<int name="blue">1</int>
...
facet.query Labels
-
facet.query params are echoed
verbatim when returning the constraint counts
-
When declaring a facet.query in
your solrconfig.xml, you can include a
"label" that your presentation layer can parse out for display
facet.query = {!label="Hot!"}
+pop:[1 TO *]
+pub_date:[NOW/DAY-1DAY TO *]
<lst name="facet_queries">
<int name="{!label="Hot!"}...">15</int>
</lst>
...
You can also use {!key}
with facet.query, but then the query
string is not included in the response, and your presentation
layer won't know what to specify in
an fq to filter on that constraint.
Taxonomy Facets
- What If Your Documents Are Organized in a Taxonomy?
- What if you want to treat that Taxonomy as a Facet?
Note that the dotted lines represent the possibility of many
more nodes in the Taxonomy, and that documents can be
associated with any nodes (not just leaf nodes). We also have
some documents associated with more then one node (History of
Physics)
Taxonomy Facets: Data
-
Flattened Data
Doc#1: NonFic > Law
Doc#2: NonFic > Sci
Doc#3: NonFic > Hist
Doc#3: NonFic > Sci > Phys
-
Indexed Terms
Doc#1: 0/NonFic, 1/NonFic/Law
Doc#2: 0/NonFic, 1/NonFic/Sci
Doc#3: 0/NonFic, 1/NonFic/Hist,
1/NonFic/Sci, 2/NonFic/Sci/Phys
Abbreviations are being used for the Indexed Terms instead of
the full category names to save space in the slides.
If every document in our "corpus" was part of the Taxonomy,
and "NonFic" is the root node, then we wouldn't need to
index a term for it.
If our taxonomy extends up (ie: NonFic has a
parent node) or if some documents are not in the Taxonomy at all,
then it definitely matters
Taxonomy Facets: Initial Query
facet.field = category
facet.prefix = 1/NonFic
facet.mincount = 1
<result numFound="164" ...
<lst name="facet_fields">
<lst name="category">
<int name="1/NonFic/Sci">2</int>
<int name="1/NonFic/Hist">1</int>
<int name="1/NonFic/Law">1</int>
Note that some docs may not even be mapped to the Taxonomy, so
they aren't included in the counts. In these examples, we're
assuming lots of docs in the index, but only the 3 previously
mentioned documents are categorized in the Taxonomy.
Taxonomy Facets: Drill Down
fq = {!raw f=category}1/NonFic/Sci
facet.field = category
facet.prefix = 2/NonFic/Sci
facet.mincount = 1
<result numFound="2" ...
<lst name="facet_fields">
<lst name="category">
<int name="2/NonFic/Sci/Phys">1</int>
In this example, the nodes were encoded into Terms in such a
way that as we drill down the Taxonomy (using
facet.prefix) we only ever see the
immediate children of the "current" node. We can eliminate
the usage of facet.prefix to always
see the "active branches" of the Taxonomy, or with some added
creativity and redundency in the Terms we index, we can also
make it always show N levels below the current node.
Faceting Coarse Date Fields
- Situation: Dates are indexed with coarse granularity
-
Anoyance: Date Faceting Constraints are inclusive of both
range boundaries
-
Problem: Many documents have values that call on exact
boundaries, and are included in multiple constraints
-
Solution: Fudge your field values (or your Constraints)
by 1 millisecond
I would typically recommend a "+1MILLISECOND" fudge factor
when indexing your data because it keeps the day, month, and
year accurate for all date values, and keeps the faceting
simple.
Alternatively a "-1MILLISECOND" fudge in the Constraints can
also work in some cases, but it changes the "day" (and can
have cascading changes to the month and year) so it can
confuse people if you just do simple formating/truncation of
the output values. It can also be problematic for "Month
Faceting" (2010-01-31T23:59:59.999Z +1MONTH =
2010-02-28T23:59:59.999Z +1MONTH = 2010-03-28...)
The new facet.date.include param
(discussed later in this presentation) eliminates the problem
of overlapping ranges in the facet results, but there is still
the issue of constructing a query to
apply the corresponding constraint as a filter.
SOLR-355
adds functionality to the Lucene QueryParser to allow mixed
usage of '{' and ']' in range queries to make this possible.
It has been committed to the trunk, so it should certainly be
included in Solr 4.0, and it may be back ported and included in
Solr 3.1 as well.
SOLR-1896
also aims to simplify this in a way that will be trivial to use in
conjunction with facet.date.include,
but no patch is currently available.
Performance Gotchas
facet.method=...
-
facet.field can use two different
algorithms depending on
the facet.method param:
-
fc: Iterates over the result
documents, incrementing a count for each associated Term.
This uses the fieldCache (same
as sort) for single valued
fields, or the fieldValueCache
for multi-valued fields.
-
enum: Iterates over all the
Terms in the field, computing counts using set
intersections with the result documents. This uses the
filterCache (same
as fq)
-
fc is the default method for
non-Boolean Fields
-
enum can be a better method
when the number of distinct Terms is going to remain small and
fixed no matter how many documents you index
Lucene (and Solr) are "Inverted Indexes" - given a Set of
"Docs->Fields" mappings, the Fields are parsed into
Terms and a "Terms->Docs" data structure is used for fast
lookups.
fieldCache and
fieldValueCache are essentially
"Inverted, Inverted Indexes" (or to put it another way:
"UnInverted Indexes") supporting fast Doc->Term(s) lookups.
An example when facet.method=enum
is a good choice is something like
a us_state field. No matter how
many documents you have in your index, or how many documents
match a given search request, the number of distinct Terms
(and the number of set intersections) will never be more then
50.
Another example when facet.method=enum
is a good choice is when you run out of RAM trying to build
the fieldValueCache. (Common when
trying to facet on a Full-Text field)
facet.enum.cache.minDf=...
-
facet.method=enum normally results
in the filterCache being used for
every Term in the Facet field.
-
For some Terms, the total number of documents containing that
Term may be so small, the time/effort spent computing the set
of documents is cheaper then the "slot" it will take up in the
filterCache
(when it might evict something else)
-
facet.enum.cache.minDf allows you
to pick a minimum document
frequency. The filterCache will not
be used for Terms that are in fewer documents then the value
you specify.
-
The default value is 0, increasing
it will result in increased execution time, but decreased
memory (ie: filterCache) usage.
facet.method=enum
and facet.enum.cache.minDf=100 (or
higher) is basically the only way to feasible Facet on a
Full-Text field.
Cache Settings
-
Make sure filterCache
and fieldValueCache are big
enough to handle your expected Facet requests w/o constant
eviction
-
Use autowarmCount on
your filterCache. Consider using it
on your fieldValueCache.
<filterCache class="solr.FastLRUCache"
size="1024"
initialSize="1024"
autowarmCount="1024"/>
Sizing caches appropriately depends on how many differnet
fields you expect to Facet on, how many distinct Facet
Queries, etc.... Step one towards picking good numbers is
looking at your caching stats before and after typical
queries. If a single request, on an isolated dev instance
that you just started up, causes any evictions: your cache is
too small. Either increase the size, or get rid of it, it's
just going to slow you down.
autowarmCount controls how many
values are pre-computed for each new cache instance whenever
the old one needs to be thrown out (ie: a new index is
loaded). The items to pre-compute are selected from the "top"
keys of the existing cache. Using autowarming
for fieldValueCache is qustionable
because the objects are so big -- one stray request to Facet
on an atypical field could eat up a lot of memory until you
completley shut down Solr.
Static Warming
-
Use firstSearcher to execute some
"seed" queries with expected Facets when Solr is initially run.
-
Use newSearcher to execute some
queries that will populate
your fieldCache
and fieldValueCache when Solr loads
a new index.
<listener event="newSearcher"
class="solr.QuerySenderListener">
<arr name="queries">
<lst>
<str name="q">*:*</str>
<str name="facet">true</str>
<str name="facet.field">category</str>
...
Static warming queries should simulate real queries as much a
possible, but there is no need for a lot of overlap. The key
is to ensure that every field you expect to Facet on is hit,
so that the caches used are properly seeded.
newSearcher events automaticly
trigger "auto warming" for caches that configure it,
but fieldCache doesn't currently
support auto warming, and as mentioned previously there are
performance reasons why auto
warming fieldValueCache may not be
wise. (Those same reasosn would apply
to fieldCache even if it did support
auto warming)
Coming Soon(ish)
term QParser
-
All of the advantages of the raw
QParser
-
Will also work on encoded numeric fields
fq = {!term f=category}Books & Magazines
fq = {!term f=weight}1.56
SOLR-2113
Tracks this functionality. It has already been
committed to the trunk, so it should certainly be included in
Solr 4.0, and it will likely be back ported and included in
Solr 3.1 as well.
facet.range=...
Same semantics as facet.date, but
also works on any numeric field type (that supports range queries)
facet.range = popularity
facet.range.start = 0.0
facet.range.end = 100.0
facet.range.gap = 33.333334
This funcionality will be included in Solr 3.1
FYI: Because of this more generalized functionality,
facet.date will likely be deprecated
in future releases.
facet.range.include=...
New multivalued param for controlling which end points of the
Constraint ranges are inclusive.
- lower - Gap Constraints
include their lower bound
- upper - Gap Constraints
include their upper bound
- edge - First/Last Gap Constraints
include the 'edge' bounds even if
upper
or lower is not specified
- outer - the 'before' and 'after'
Constraints will include their bounds, even if they are
already included by a Gap Constraint.
- all - shorthand for
lower, upper,
edge, and outer
This funcionality will be included in Solr 3.1
FYI: facet.date.include has also been
added.
Pivot Facets
-
Computes a Matrix of Constraint Counts across multiple Facet Fields
to generate a "Decision Tree"
-
"If you constrain Facet X by A, then the Constraint Counts
for Facet Y will be ..."
<lst name="cat,inStock">
<lst name="electronics">
<int name="true">10</int>
<int name="false">4</int>
</lst>
<lst name="memory">
<int name="true">3</int>
<int name="false">0</int>
</lst>
...
This feature is being tracked in
SOLR-792.
It has already been committed to
the trunk, so it should certainly be included in
Solr 4.0. It may be back ported and included in
Solr 3.1 as well, depending on demand and complexity in portin.
Native Taxonomy Facets
-
Proposed FieldType and faceting code
that transparently deals with Taxonomy based fields (ie: no
special indexing code needed to generate all the Terms)
-
Includes a facet.depth param for
controlling how many levels of the Taxonomy to display counts
for on any given request.
This feature is being tracked in
SOLR-64.
A patch is available but it has not yet been committed, so it
is not certain what release it may be in.
I find the output format currently used by the patch to be too
confusing to even show on this slide.