Start Slideshow

This document is designed to be viewed as a reveal.js slide show presentation. You can either read/print the contents of the presentation (and the speaker notes) below, or click here to view slideshow.

(Please consult the reveal.js docs for information on how to control the slideshow once it's started)

A ZIP file containing this presentation is available for offline perusal.

IR, Solr & Data Science: An Introduction in 3 Acts

Chris "Hoss" Hostetter - 2017-09-28


https://home.apache.org/~hossman/meetup20170928/

https://twitter.com/_hossman

https://www.lucidworks.com/


Who Am I?

  • Software Developer
  • NO Formal/Informal Science / Academic background
  • ~20 Years working on "Search" software
  • ~13 Years working on Lucene/Solr
  • Employeed by Lucidworks to "Make Solr Better"

Agenda

  1. What's Information Retrieval?
  2. What's Solr & Why do Software Devs use it?
  3. Why should Data Scientists be interested in Solr?

Act I
Intro to IR

The True Story of King Index*

* Not A True Story

The First Inverted Index*

*Not Really

  • Sir Ruprect - p6, p9, p32, ...
  • King Richard - p7, p9, p16, ...
  • Queen Amidala - p8, p9, p32, ...
  • Prince Cletus - p15, p17, p23
  • ...
  • King Cletus IV - p427, p428, p457

Information Retrieval

According to the Gospel of Hoss*


  1. Spend "Work" (Time | CPU | Disk | RAM) in advance (Indexing) to save "Work" when retrieving (Quering)
  2. Spend "Work" to "Score" information retrieved against the Query that retrieved it -- relative to all other known information

*Not an Actual Gospel

Scoring Function (BM25)

Act II
Solr For Devs/Biz

What Is Solr?

Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

Less Marketing, More Tech?

  • Search centric data-store, implemented in Java
  • HTTP based API for indexing & querying structured data
  • Lots of configuration & customization options and plugin support
  • Scalable Horizontally (more data) & Vertically (more reliability) across multiple machines
  • Open Source, governed by the Apache Software Foundation

Scalability & Reliability

The error rate is down by two orders of magnitude with 99% of search results served in under 500ms. The number of machines needed to run search dropped from ~200 earlier this year down to ~30 so we even managed to get some cost savings.

Features & Configurability

"name":"text_ca",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer":{
  "tokenizer":{
    "class":"solr.StandardTokenizerFactory"},
  "filters":[{
      "class":"solr.ElisionFilterFactory",
      "articles":"lang/contractions_ca.txt",
      "ignoreCase":"true"},
    {
      "class":"solr.LowerCaseFilterFactory"},
    {
      "class":"solr.StopFilterFactory",
      "words":"lang/stopwords_ca.txt",
      "ignoreCase":"true"},
    {
      "class":"solr.SnowballPorterFilterFactory",
      "language":"Catalan"}]}}

Act III
Solr For Data-Sci?

Wanna Search Some Stuff?

$ bin/solr -e schemaless
$ bin/post -c gettingstarted ~/
$ curl 'http://localhost:8983/solr/gettingstarted/select?q=vegas&fl=id,content_type'
...
"response":{"numFound":14,"start":0,"docs":[
{
  "id":"/home/hossman/taxes/2016/hoss-expense-reports/oct-bos-vegas/notes.txt",
  "content_type":["text/plain; charset=ISO-8859-1"]},
{
  "id":"/home/hossman/taxes/2017/hoss-expense-reports/sep-vegas/expense_report_with_all_receipts.pdf",
  "content_type":["application/pdf"]},
{
  "id":"/home/hossman/taxes/2016/hoss-business-expense-summary.txt",
  "content_type":["text/plain; charset=ISO-8859-1"]},
...

Structured Data? Even Better...

$ bin/solr -e schemaless
$ bin/post -c gettingstarted -Dparams='trim=true' ~/crime-data/*.csv
$ curl '.../select?q=chrgdesc:"SUSPENDED+DRIVER+LICENSE"&fl=case_id,neighborhd,age,sex'
...
"response":{"numFound":1848,"start":0,"docs":[
    {
      "case_id":[1507010082],
      "age":[30],
      "sex":["F"],
      "neighborhd":["T303"]},
    {
      "case_id":[1507160112],
      "age":[30],
      "sex":["M"],
      "neighborhd":["T306"]},
    {
      "case_id":[1508260084],
      "age":[34],
      "sex":["M"],
      "neighborhd":["T205"]},
...

Data Exploration

"tokenizer":{
  "class":"solr.StandardTokenizerFactory"},
"filters":[{
    "class":"solr.LowerCaseFilterFactory"},
  {
    "class":"solr.StopFilterFactory",
    "format":"snowball",
    "words":"lang/stopwords_de.txt",
    "ignoreCase":"true"},
  {
    "class":"solr.GermanNormalizationFilterFactory"},
  {
    "class":"solr.GermanLightStemFilterFactory"},
  {
    "class":"solr.ShingleFilterFactory",
    "minShingleSize":"2",
    "maxShingleSize":"2",
    "outputUnigrams":"false"}]}}}
$ curl '.../terms?terms.limit=500&terms.fl=_text_&'
{ "terms":{
    "_text_":[
      "_ ganz",1562,
      "_ mehr",1561,
      "_ schon",1559,
      "_ gross",1557,
      "zeit _",1553,
      "_ gut",1552,
...
      "produced by",1515,
      "_ aug",1514,
      "_ leb",1514,
      "seh _",1513,
...

Data Analytics

$ curl http://localhost:8983/solr/gettingstarted/select \
-d 'rows=0&q=chrgdesc:"SUSPENDED+DRIVER+LICENSE"&
 json.facet={
   where:{
     type : terms,
     field : neighborhd,
     limit : 100,
     facet:{
       stddev : "stddev(field(age,min))",
       mean_age : "avg(field(age,min))"
     }
   }
 }
'
"facets":{
  "count":1848,
  "where":{
    "buckets":[
      { "val":"t103",
        "count":76,
        "stddev":11.214820101289954,
        "mean_age":35.6986301369863},
      { "val":"t403",
        "count":70,
        "stddev":11.720066692176045,
        "mean_age":33.93939393939394},
...
      { "val":"t502",
        "count":45,
        "stddev":9.28920105274652,
        "mean_age":28.727272727272727},
...
      { "val":"t503",
        "count":11,
        "stddev":13.14446117207289,
        "mean_age":43.36363636363637},
...

Etc...

  • Streaming Expressions
    • Timeseries Analytics
    • Randomized Sampling
  • Graph Traversal
  • Learning To Rank
  • Document Clustering

Q & A