{"id":2753,"date":"2010-07-30T20:34:43","date_gmt":"2010-07-31T04:34:43","guid":{"rendered":"https:\/\/www.ultrasaurus.com\/?p=2753"},"modified":"2010-07-30T20:34:43","modified_gmt":"2010-07-31T04:34:43","slug":"lucenesolr-meetup-july-28","status":"publish","type":"post","link":"https:\/\/www.ultrasaurus.com\/2010\/07\/lucenesolr-meetup-july-28\/","title":{"rendered":"lucene\/solr meetup, july 28"},"content":{"rendered":"<p>I attended the <a href=\"http:\/\/www.meetup.com\/SFBay-Lucene-Solr-Meetup\/calendar\/14124466\/\">Lucene\/Solr meetup<\/a> this week &#8212; quite a swank event sponsored by <a href=\"http:\/\/www.salesforce.com\/\">Salesforce<\/a> with tasty appetizers, beers and an incredible view of the bay.  The three speakers were very knowledgeable and well spoken and I enjoyed hearing about the different applications of Lucene and Solr.  Below are my rough notes.\u00a0 For folks who want to learn more about Lucene and Solr, check out the upcoming conference <a href=\"http:\/\/lucenerevolution.org\/\">Lucene Revolution<\/a>, Oct  5-8, 2010 in Boston.<\/p>\n<h2>Search@salesforce.com, Bill Press, Salesforce<\/h2>\n<p>Salesforce uses Lucene 2.2 (not Solr) and shared some stats about their seriously large scale operation:<\/p>\n<ul>\n<li>millions of searches per day, hundreds of thousands of users<\/li>\n<li>hundreds of millions of doc updates per day<\/li>\n<li>force.com platform, 72,500+ customers, 150,000+ apps<\/li>\n<li>almost half a billion indexing events per day (batches can include &gt; 1000 documents)<\/li>\n<li>Over 8 TB of searchable data<\/li>\n<li>incremental indexing (90% &lt; 3 mins, 70% &lt; 1 min )<\/li>\n<li>6M queries per day, mean search time 250ms (76% &lt; 250 ms, 89% &lt; 500 ms)<\/li>\n<\/ul>\n<p>It&#8217;s a multi-tenant architecture, each org has 1-100,000s users and had a single codebase, which means there is just 1 version to support at one time.<\/p>\n<ul>\n<li>consistent hashing for node affinity<\/li>\n<li>throttling for fairness<\/li>\n<li>record type bucketing, as well as by org<\/li>\n<\/ul>\n<p>They use post-filtering for:<\/p>\n<ul>\n<li> authorization<\/li>\n<li> reranking in the DB, last update<\/li>\n<\/ul>\n<p>They query db to bridge the gap with indexing lag.<\/p>\n<p>They are faced with new search challenges driven by what Salesforce CEO calls &#8220;the facebook imperative.&#8221; When he started Salesforce, he used to ask &#8220;why donesn&#8217;t every enterprise app look like amazon?&#8221; Now he asks: &#8220;why doesn&#8217;t every enterprise app look like Facebook?&#8221;\u00a0 (side note: this is an echo of what many folks have been saying for a while, that social networking makes sense as a feature of an app, rather than just destinations like Facebook and LinkedIn.)<\/p>\n<p>Salesforce allows you to have a feed on a record, follow accounts, status updates for accounts.\u00a0 They index tracked changed.\u00a0 They need to search this rich set of data which is people articulating their interests. Bill noted that the needs of structured data are really different from unstructured data.<\/p>\n<h2>Practical Relevance, Grant Ingersoll, Lucid Imagination<\/h2>\n<p>Grant Ingersoll spoke of &#8220;two tales of relevance&#8221;<\/p>\n<ul>\n<li>The case of the missing data: you know you have poor relevance when the most important search result is on page 10.\u00a0 For example, the accessories for an item are often listed higher on search results than the item itself.<\/li>\n<li>The power of suggestion.\u00a0 Grant cited a specific case where just adding auto-suggest added 100s of millions to the bottom line.<\/li>\n<\/ul>\n<p>Better search results = less time searching, more time acting<\/p>\n<p>Other cases to consider:<\/p>\n<ul>\n<li>Only the first result matters, such as Google&#8217;s &#8220;I&#8217;m feeling lucky&#8221;<\/li>\n<li>Known item searches, for example: NetFlix has a high frequency of people searching for specific movies<\/li>\n<li>Are you finding all the documents that are relevant?\u00a0 In the case where you want to analyze all the results returned\/<\/li>\n<li>Is zero results the right answer?\u00a0 Where people want to definitely know that something is not present<\/li>\n<li>Is it important that you don&#8217;t have a result that doesn&#8217;t match (e.g. Yelp doesn&#8217;t want to find a plumber talking about unclogging what you just ate when you are looking for a restaurant)<\/li>\n<\/ul>\n<p>Befre undertaking any relevance tuning, you need to define what &#8220;better search&#8221; means to you.\u00a0 There are many ways to test and measure:<\/p>\n<ul>\n<li>a\/b testing<\/li>\n<li>log analysis<\/li>\n<li>empirical (top x queries, plus random sample) &#8212; read and evaluate queries, top 10, top 50, have your top biz people rate what is important &#8211;&gt; leads to actionable data<\/li>\n<li>ask your users, thumbs up\/down around your search results<\/li>\n<li>Ad Hoc evaluation<\/li>\n<li>TREC &#8212; fixed data set, fixed queries, see open relevance project (open source TREC)<\/li>\n<\/ul>\n<p>Capturing user feedback:<\/p>\n<ul>\n<li> log analisys (click analyss)<\/li>\n<li> rating\/reviws<\/li>\n<li> filters and facets<\/li>\n<\/ul>\n<p>Grant notes that Lucene searches default to &#8220;or&#8221; out of the box, when &#8220;and&#8221; is typically better today.\u00a0 He had a list of links that he suggested we check out (sadly I couldn&#8217;t type fast enough, but here are some I wrote down):<\/p>\n<ul>\n<li><a href=\"http:\/\/code.google.com\/p\/luke\/\">code.google.com\/p\/luke<\/a><\/li>\n<li>solar analysis tool<\/li>\n<li><a href=\"http:\/\/sigir.org\/\">sigir.org<\/a><\/li>\n<li><a href=\"http:\/\/www.lucidimagination.com\/Community\/Hear-from-the-Experts\/Articles\/Debugging-Relevance-Issues-Search?utm_medium=lucene.li-copypaste&amp;utm_source=direct-lucene.li&amp;utm_content=awesm-site\">Debugging Relevance Issues in Search<\/a><\/li>\n<\/ul>\n<p>auto-add phrases to your questies &#8212; surround with quotes &#8212; automtric win<br \/>\nauto-add a &#8220;sloppy phrase&#8221;  &#8212; large slop factor, like an AND, boost when words are close<\/p>\n<h2>Logs,  Search, Cloud, Jon Gifford, Loggly<\/h2>\n<p>Logfile managemetn in the cloud (no Hadoop).\u00a0 Logs are painful &#8212; distributed, large, ephemeral.\u00a0 Most log search is hightly skewed.\u00a0 &#8220;We&#8217;re just implementing grep across terabytes of data.&#8221;\u00a0 This was a compelling talk, but it took most of my attention to follow, so my notes are weak and may make sense to no one except me:<\/p>\n<p>syslog + 0MQ + SolrCloud<br \/>\n0MQ &#8211; not traditional queing, it fails, when it fails we lose data, but it is very fast<br \/>\nSolr give s us facets which gives us graphs<\/p>\n<p>run many indexers, &#8220;hot shards&#8221; &#8212; the indexers update small shards<\/p>\n<p>0MQ gives us node-specific input queues for Solr<\/p>\n<p>nrt + solrCloud = Our Nirvana<\/p>\n<p>Hot shards re chilled when we stop writing to them<\/p>\n<p>Solr is awesome at what it does, but not so good for data mining<br \/>\n&#8212; plan to plug in Hadoop for large-volume analytics<br \/>\nSyslog is the only way in for now, adding others, http, scribe, flume,<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I attended the Lucene\/Solr meetup this week &#8212; quite a swank event sponsored by Salesforce with tasty appetizers, beers and an incredible view of the bay. The three speakers were very knowledgeable and well spoken and I enjoyed hearing about the different applications of Lucene and Solr. Below are my rough notes.\u00a0 For folks who&hellip; <a href=\"https:\/\/www.ultrasaurus.com\/2010\/07\/lucenesolr-meetup-july-28\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":84,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/posts\/2753"}],"collection":[{"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/users\/84"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/comments?post=2753"}],"version-history":[{"count":0,"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/posts\/2753\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/media?parent=2753"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/categories?post=2753"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ultrasaurus.com\/wp-json\/wp\/v2\/tags?post=2753"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}