Start Slideshow

This document is designed to be viewed as a reveal.js slide show presentation. You can either read/print the contents of the presentation (and the speaker notes) below, or click here to view slideshow.

(Please consult the reveal.js docs for information on how to control the slideshow once it's started)

A ZIP file containing both of my ApacheCon 2014 NA presentations is available for offline perusal.

What's New In
Apache Solr?


ApacheCon 2014 NA - 2014-04-07

https://people.apache.org/~hossman/ac2014na

https://twitter.com/_hossman

http://www.lucidworks.com/


Acceleration!

Graph of Solr Release Dates

Adding Data &
schema.xml

(Quick Refresher)

Java (SolrJ)

SolrInputDocument doc = new SolrInputDocument();
doc.addField("id",11852);
doc.addField("title","Nightfall");
doc.addField("url","http://www.isfdb.org/cgi-bin/title.cgi?11852");
doc.addField("rating",8.7F);
doc.addField("authors","Isaac Asimov");
doc.addField("authors","Robert Silverberg");
solr_server.add(doc);

POST XML

<add>
 <doc>
  <field name="id">11852</field>
  <field name="title">Nightfall</field>
  <field name="url">http://www.isfdb.org/cgi-bin/title.cgi?11852</field>
  <field name="rating">8.7</field>
  <field name="authors">Isaac Asimov</field>
  <field name="authors">Robert Silverberg</field>
  ...

POST JSON

[
 {
  "id":11852,
  "title":"Nightfall",
  "url":"http://www.isfdb.org/cgi-bin/title.cgi?11852",
  "rating":8.7,
  "authors":[ "Isaac Asimov", 
              "Robert Silverberg" ]
  ...

schema.xml Fields

<field name="id"        type="tint"   indexed="true"  stored="true" /> 
<field name="title"     type="text"   indexed="true"  stored="true" />
<field name="url"       type="string" indexed="false" stored="true" />
<field name="summary"   type="text"   indexed="true"  stored="false"/>
<field name="rating"    type="tfloat" indexed="true"  stored="true" />
<field name="authors"   type="text"   multiValued="true" />
...

schema.xml Field Types

<fieldType name="string" class="solr.StrField" />
<fieldType name="tint"   class="solr.TrieIntField" precisionStep="0" />
<fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="0" />
<fieldType name="text"   class="solr.TextField" indexed="true" stored="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" />
    ...

Schema API

FieldTypes API

GET .../schema/fieldtypes
{
 "fieldTypes":[{
      "name":"string",
      "class":"solr.StrField",
      "fields":["url",...]},
    {
      "name":"tint",
      "class":"solr.TrieIntField",
      "precisionStep":"0",
      "fields":["id"]},
    {
      "name":"text",
      "class":"solr.TextField",
      "indexed":true,
      "stored":false,
      "indexAnalyzer":{
        "tokenizer":{
          "class":"solr.StandardTokenizerFactory"}
  ...

Fields API

GET .../schema/fields
{
  "fields":[{
      "name":"id",
      "type":"tint",
      "indexed":true,
      "stored":true},
    {
      "name":"url",
      "type":"string",
      "indexed":false,
      "stored":true},
    {
      "name":"authors",
      "multiValued":true,
      "type":"text"},
    ...

Managed
Schema

Mutable Schema

Enabled in solrconfig.xml
<schemaFactory class="ManagedIndexSchemaFactory">
  <bool name="mutable">true</bool>
  <str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>

Schema-Less?

Add Fields Automatically

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
  <str name="defaultFieldType">text</str>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Boolean</str>
    <str name="fieldType">boolean</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.util.Date</str>
    <str name="fieldType">tdate</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Integer</str>
    <str name="fieldType">tint</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Number</str>
    <str name="fieldType">tdouble</str>
  </lst>
</processor>

Value Type Helpers

<processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>
<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDateFieldUpdateProcessorFactory">
  <str name="defaultTimeZone">Europe/Paris</str>
  <str name="locale">fr_FR</str>
  <arr name="format">
    <str>'le' EEEE dd MMMM yyyy</str>
    <str>'le' dd MMM. yyyy 'à' HH 'h' mm</str>
    ...

Queries &
Pagination

(Quick Refresher)

Search, Filter, Sort, Paginate

    q = title:Nightfall      # Affects score
   fq = rating:[5.0 TO *]    # Constrains result set, non-scoring
 sort = score desc           # Order of result list
start = 0                    # Offset in result list
 rows = 20                   # Size of result list slice

Page #1

...&sort=score+desc&rows=20&start=0
{
  "response":{"numFound":32145678,"start":0,"docs":[
    {
      "id":11852,
      "title":"Nightfall",
      "url":"http://www.isfdb.org/cgi-bin/title.cgi?11852",
      "rating":8.7,
      "authors":[ "Isaac Asimov", 
                  "Robert Silverberg" ]},
    ...

Page #2

...&sort=score+desc&rows=20&start=20
{
  "response":{"numFound":32145678,"start":20,"docs":[
    {
      "id":15475,
      "title":"The Legend of Nightfall",
      "url":"http://www.isfdb.org/cgi-bin/title.cgi?15475",
      "rating":4.3,
      "authors":[ "Mickey Zucker Reichert" ]},
    ...

Pagination
Performance?

Chart showing relative performance of Classic Pagination vs Cursor Pagination for deep paging

RED IS BAD

Cursors To The
Rescue!

Start Cursor

...&sort=score+desc,id+desc&cursorMark=*
{
  "response":{"numFound":32145678,"start":0,"docs":[
    {
      "id":11852,
      "title":"Nightfall",
      "url":"http://www.isfdb.org/cgi-bin/title.cgi?11852",
      "rating":8.7,
      "authors":[ "Isaac Asimov", 
                  "Robert Silverberg" ]},
    ...
    ]},
  "nextCursorMark":"AoEjR0JQ"}

Next Cursor

...&sort=score+desc,id+desc&cursorMark=AoEjR0JQ
{
  "response":{"numFound":32145678,"start":0,"docs":[
    {
      "id":15475,
      "title":"The Legend of Nightfall",
      "url":"http://www.isfdb.org/cgi-bin/title.cgi?15475",
      "rating":4.3,
      "authors":[ "Mickey Zucker Reichert" ]},
    ...
    ]},
  "nextCursorMark":"AoEpVkRCREIxQTE2"}
Chart showing relative performance of Classic Pagination vs Cursor Pagination for deep paging

GREEN IS GOOD

Sorting, Faceting,
& DocValues

Inverted Indexes

☑ Fast Term Search

Input

  • D1: Nightfall
  • D2: A Time to Rend
  • D3: The Legend of Nightfall
  • D4: Legends from the End of Time
  • D5: Time of Legends
  • D6: Legends
  • D7: About Time

Index

  • a → D2
  • about → D7
  • end → D4
  • from → D4
  • legend → D3, D4, D5, D6
  • nightfall → D1, D3
  • of → D3, D4, D5
  • rend → D2
  • the → D3, D4
  • time → D2, D4, D5
  • to → D2

Inverted Indexes

☑ Fast Range Search

Input

  • D1: 8.7
  • D2: 3.2
  • D3: 4.3
  • D4: 4.3
  • D5: 1.8
  • D6: 5.7
  • D7: 4.5

Index

  • 1.8 → D5
  • 3.2 → D2
  • 4.3 → D3, D4
  • 4.5 → D7
  • 5.7 → D6
  • 8.7 → D1

Inverted Indexes

☐ Fast Sorting

☐ Fast Faceting

Input

  • D1: 8.7
  • D2: 3.2
  • D3: 4.3
  • D4: 4.3
  • D5: 1.8
  • D6: 5.7
  • D7: 4.5

Index

  • 1.8 → D5
  • 3.2 → D2
  • 4.3 → D3, D4
  • 4.5 → D7
  • 5.7 → D6
  • 8.7 → D1

FieldCache

  • D1 → 8.7
  • D2 → 3.2
  • D3 → 4.3
  • D4 → 4.3
  • D5 → 1.8
  • D6 → 5.7
  • D7 → 4.5

DocValues

☑ Fast Sorting

☑ Fast Faceting

Input

  • D1: 8.7
  • D2: 3.2
  • D3: 4.3
  • D4: 4.3
  • D5: 1.8
  • D6: 5.7
  • D7: 4.5

DocValues

  • D1 → 8.7
  • D2 → 3.2
  • D3 → 4.3
  • D4 → 4.3
  • D5 → 1.8
  • D6 → 5.7
  • D7 → 4.5

DocValues in schema.xml

<field name="rating"     type="tint"   indexed="true"  docValues="true"  />
<field name="title_sort" type="string" indexed="false" docValues="true"  />
<field name="title"      type="text"   indexed="true"  docValues="false" />

Solr Cloud

(Just Scratching the Surface)

New In Solr Cloud

  • Custom Sharding & Routing via router.name:
    • compositeId: Hash based, optionally driven by id prefix
    • implicit: Infer shard from where client sent document
      • ....or _route_ param
  • Shard Splitting w/o Downtime:
    • Divide shards in two on the fly as your index grows
    • Optionally split by ranges or route key via split.key
  • Hadoop Integration:
    • Keeping indexes in HDFS
    • Building Indexes with Map Reduce

Documentation!

(No, Seriously: Good Documentation)

Solr Reference Guide

https://lucene.apache.org/solr/documentation.html

  • Online "Live" documentation maintained in Confluence
    • Supports public comments for questions & feedback
  • Formally released PDFs for each major feature release of Solr
    • Available from the Apache mirror network

Anything Else?

Lots More, Not Enough Time

Q & A