Solr, Zookeeper and Lucidworks Fusion as Windows Services

NSSM – Solr – Zookeeper – Fusion

Fusion Production Farm on Windows Servers

All parts run as Windows Services to avoid stoppages when users log out

Written by Anria Billavara – Innovent Solutions Inc. 8 Dec 2015

Requirements and Constraints

This is a case of a 3 Virtual Servers farm, in a very Microsoft dependent environment.

Every piece of software must run as a self sufficient Windows Service, which can start itself automatically upon server restart.

Software

Use Software
Operating System Windows Server R2
Solr 5.2.1
Zookeeper 3.5.1
Lucidworks Fusion 2.1.1
Java jdk-7u79-windows-x64
Windows Services Download here
Important NSSM does NOT work with JDK 8
Windows Services Fusion Download here

Install Windows Services script

Next to the nssm.exe executable, create this installWindowsServices.bat file.

NOTE: Obviously change the paths to the paths you installed on.

@echo off


echo ------------------------------------------
echo - Install Windows Service : ApacheZookeeper
echo ------------------------------------------

nssm.exe install ApacheZookeeper Application D:\apache\zookeeper351\bin\run.bat
nssm.exe set ApacheZookeeper Application D:\apache\zookeeper351\bin\run.bat
nssm.exe set ApacheZookeeper AppDirectory D:\apache\zookeeper351\bin
nssm.exe set ApacheZookeeper DisplayName Apache Zookeeper
nssm.exe set ApacheZookeeper Description Apache Zookeeper 3.5.1
nssm.exe set ApacheZookeeper Start SERVICE_AUTO_START
nssm.exe set ApacheZookeeper AppRestartDelay 0


echo ------------------------------------------
echo - Install Windows Service : ApacheSolr
echo ------------------------------------------

nssm.exe install ApacheSolr Application D:\apache\solr521\bin\run.bat
nssm.exe set ApacheSolr Application D:\apache\solr521\bin\run.bat
nssm.exe set ApacheSolr AppDirectory D:\apache\solr521\bin
nssm.exe set ApacheSolr DisplayName Apache Solr
nssm.exe set ApacheSolr Description Apache Solr 5.2.1
nssm.exe set ApacheSolr Start SERVICE_AUTO_START
nssm.exe set ApacheSolr DependOnService ApacheZookeeper
nssm.exe set ApacheSolr AppRestartDelay 15


echo ------------------------------------------
echo - Install Windows Service : ApacheSolr
echo ------------------------------------------

prunsrv //IS//LWFusionService --DisplayName="Lucidworks Fusion" --Description="Lucidworks Fusion Control Script" --StartMode=exe --StartPath="d:\Lucidworks\fusion211\bin" --StartImage="d:\Lucidworks\fusion211\bin\fusion.cmd" --StartParams="start" --StopMode=exe --StopPath="d:\Lucidworks\fusion211\bin" --StopImage="d:\Lucidworks\fusion211\bin\fusion.cmd" --StopParams="stop" --StopTimeout=60 --Startup="auto" --LogPath="d:\Lucidworks\log\windowsservice\fusion"



echo Done 

in CMD prompt, execute installWindowsServices.bat

installServices

ERRORS in the Windows Service? – Remove them

Create a removeServices.bat file.

nssm.exe remove ApacheSolr
nssm.exe remove ApacheZookeeper
prunsrv //DS//LWFusionService

Make the service have a admin login

Solr Data Import Handler

The first thing that comes to mind when clients tell me they use Data Import Handler, is that I think there is a sadist in a position of perceived power on the team.
That or all the folks doing the implementation are masochists.

When deciding to use Solr in your website, there are certain

Questions you will have to ask

  1. Why is Solr right for us?
  2. Does Solr have hooks and connectors to pull data out of our data storage?
  3. Which one to choose?

These are simple beginnings to assess the need to change your search engine.

Frequently at this junction, folks see that Solr has the XML, JSON, Binary and Database import handlers.
The thinking seems to usually follow this path

Why is Solr right for us?

  1. Solr is Open Source
  2. It’s free to all, and it’s widely documented
  3. It is Super Stable
  4. It is scalable in any and all directions
  5. It is Blazing Fast!

Does Solr have hooks and connectors to pull data out of our data storage?

  1. XML, Json, Php Java and .Net binary, CSV
  2. Data Import Handler for SQL queries directly out of the database
  3. Hooks to make crawlers push data in through Tika

Which one to choose?

It is safe to assume that you chose the Data import handler, if you have read this far.  Perhaps you only want to see if you are the Sadist or the Masochist, or on the team of one. Perhaps you don’t want to maintain a proper ETL layer.  Whatever your reason for choosing the Data Import Handler, I will always advise that there is a better way of doing things.  However, since you are here, let’s talk about all the little ways that this choice is going to complicate the maintenance of your product for the next few years until you replace it with a proper ETL process.

When is the DIH a good choice?

  1. When you have a very tiny and simple denormalized dataset
  2. When said tiny dataset is created from a rather simple SQL query
  3. When the DIH doesn’t have to run very often
  4. When you have somebody on staff to Press The Button – Yes, this is a reference to Lost and its Hatch.

Complications of choosing the DIH

  1. It’s extremely hard to debug.  There is no debugger, and for every tiny change you have to re-run the entire import
  2. If you change the DIH xml files to limit your queries for debugging, its easy to forget to change it back
  3. Everything goes well, and then suddenly its all really really slow
  4. When the database is overloaded, the DIH will be equally slow
  5. Single threaded
  6. In old versions of Solr, there is only one single Commit at the very end. At least in the new versions you can set up AutoCommit in the solrconfig.xml, but this is a new behavior
  7. For complex queries, and sub-entities, there are rules. Strict rules to follow. Veer from these, or fail to read the wiki with very strict scrutiny, and you are setting up for day-long debugging sessions.  You will have things such as why a field is getting filled with all kinds of values even though your query is perfect in the database.
  8. Project scope will be increased by at the very least 6 weeks

In business as in life, when the pro’s outweigh the cons, the answer is yes. Conversely, if the cons are so much more than the pros, and you still choose yes, you have a sadist in power.  Or if you made the choice to do the implementation yourself, you are the masochist.

Solr and Magento

Magento is one of those We hate a lot of it, but we love a lot about it too kind of products.

It falls squarely in the we Love to use it for our retail, as it makes online retail a breeze categories.  And it also falls very neatly in the box of we just Hate to have to do dev work for it, since it’s blocks upon blocks upon blocks, and even if you do everything right, and have two magento’s side by side, one will do the theme correctly, and one just won’t.  Why? Who knows.

Things Magento does well

  1. Online retail for non-techie people
  2. Set up polls
  3. Credit card and paypal integration at the touch of a button
  4. Product setups in Parent-child relationships and inventory control

Things Magento doesn’t do so well

  1. Search
  2. Search as you Type

Why?
Magento uses its EAV database model and hence plenty of tables to maintain products.  So when you are searching for something, it has to go through the arduous process of a Mysql search with a Where clause like this

Search Query = Red Shoes

MySql Query = Select skus from <secretTable>
Where <column> like “%Red%”
OR <column> like “%Shoes%”

In SQL land, that is a very expensive query, and relies very heavily on setting up the data in that column to contain a Lot of relevant information.

What can we do to make this better?

  1. You have to decide that this search is unbearable.
  2. You will do a lot of research about what other options are available to you
  3. You may or may not talk to a lot of search vendors and come away with varying amounts of confidence in their product’s ability to give you exactly what you want
  4. Do you have the talent and hosting in-house, or do you need it managed
  5. How about hosting open source technology and having a convenient consulting firm give you the talent on an as needed basis?

Who is out there?
Plug & Play packages with Support packages for We-Rent-Talent :

  1. Innovent Solutions Solr Connect
  2. Innovent Solutions Cloudsearch Connect

Relevance Score calculation in Solr

Lucene uses the TF/IDF scoring algorithm to give initial relevance scores to each document as served up in a search result.

The score is further influenced by how you set up your query parameters and whether or not (and how) you apply boosting techniques. The parameters that influence it are

&qf=name^2 will give higher scores than &qf=name^0.2
You can set all your QF params to be between [0,2] to get a pseudo normalized curve in the QF part of the score influence.

If you apply boosting with &boost or &bq , then you will further elevate or drop the scores.

You can do &boost=scale( <your formula>, 0, 2) to get the Influence of the boost normalized,
https://wiki.apache.org/solr/FunctionQuery#scale

OR you can simply do &boost=log(<numericField>) to make sure the score stays in a certain range

All that I showed only Influences the final score, which first is calculated based on what Abdullah and Rajesh mentioned with the TF / IDF score algorithm, which calculates Term frequency / Inverse Document frequency and much more.