Hibernate Search - cool, but is it the right approach? Year baby!

6 minute read

Sanjiv Jivan wrote a blog entry questioning the "point" of Hibernate Search. He missed some critical steps in his argumentation, that I am willing to correct. I started to answer on his blog, but the answer being fairly long, I opted for a blog entry.

I think Sanjiv failed to understand which population Hibernate Search is targeting.
Hibernate Search is about ORM. If you don't use Hibernate, if you don't use JPA, forget about Hibernate Search, it's not for you.

His main point is, why use Hibernate Search instead of a straight Lucene + Database (I'm assuming JDBC) solution? Five years before he could have asked, why use an ORM rather than a straight JDBC access? Because it does for you and optimize 90% of the job and let you focus on the 10% that is hard.
I won't explain why an ORM is usually (but not always) a good approach (everybody got that nowadays), so let's focus on a different question: considering that Hibernate is used in a given application, should we go for plain Lucene and JDBC layer as Sanjiv suggests or should we go for Hibernate Search? Should we go for 2 different set of APIs / programmatic model and model representation, or should we go for one unified model?

Let's see each of Sanjiv's concerns one at a time.

Why Hibernate Search rather than plain Lucene and JDBC?
Out of the box, setting up a plain Lucene and JDBC solution requires to write the bridge. Lucene has it's own world, the DB an other one. Your code has to bind them together (write the optimized JDBC routine + optimized Lucene index routine). It can be long, painful and buggy.
I doubt Sanjiv had to do it before, he would not talk like that :) Hibernate Search does the binding for you in your Hibernate backed application.
People are attracted by Hibernate Search because it lowers the barrier of entry to Lucene in a project by a great deal. This opens the Search capabilities to a lot of applications that would not have considered it with only plain Lucene in their hands.

Hibernate (Search) does not play well with massive indexing
Sanjiv claims that the initial indexing (or reindexing) is slow (he hasn't tried actually) and memory consuming.
Have a second look at the Hibernate Search reference documentation, the massive indexing procedure explicitly helps you to control the amount of memory spent.
In Lucene, one good rule of thumb is use as much memory as possible to minimize IO access. So yes, the more memory you'll spend the more efficient your hibernate Search massive indexing will be. You have to think about the global system, not only a subpart.

Event based indexing should not be used
Next Sanjiv tries to show that the event based indexing is wrong and that one should always use batch indexing. The honest answer is it depends.
Hibernate Search does not constraint to index things per transaction (it's a pluggable strategy), and I never said that indexing at commit time was important. Not indexing before commit time is critical (think about rollbacks).
As a matter of fact, the clustered mode (JMS mode) explicitly does not index at commit time, it delegates the work for later (and to someone else). The overhead of sending a message for later indexing (I'm not speaking of actual Lucene operations here) is minimal.
What do we gain? The usual on the fly vs batch mode benefits: no batch window, more homogeneous CPU consumption on systems, not having to take care of a batch job. I don't know about you, but the less batch jobs I have in my systems, the better I sleep.
By the way, is batch mode supported with Hibernate Search? Absolutely. Who likes to avoid batch jobs when possible, most of the developers and ops guys I have met. When you need to use them, do it ; when you don't stop the masochism.

To justify that batch mode should rules, Sanjiv used the data mining and star / snow schema as an example. These are a very specific kind of applications where ORM are almost never used. They could be, with some adjustments tot he ORM, but that's another story, maybe my next project :) Anyway, this is out of the scope of Hibernate Search, see the very first point.

I agree that JMS is highly over engineered and should be simplified in Java EE6, but come on, setting up a Queue is only a few clicks in a graphical console... it's not too bad. Don't tell me JMS is too hard (Hibernate Search does the JMS calls by the way, not you).

Hibernate Search does not support third party modifications in the database
It's actually a fairly known problem to people who use 2nd level cache in ORMs, has 2nd level cache been banned from our toolbox? clearly no. But once again Hibernate Search works fine in a batch mode. So this should solve Sanjiv's concerns.

Annotation based indexing definition is not flexible
Is that an inflexible approach? How practical would it be to change them on the fly? Changing which elements are indexed, or how would require to reindex the whole set of data. Quite possible, but definitely something that is not so useful on the fly. As for boosting, I do set my field boosting at query time, I find it more flexible than index time boosting, so I never had the issue Sanjiv is describing.

Why using Hibernate Search query API?
Why not using straight Lucene queries an APIs, it's all about text in the end?
The nice thing about the Hibernate Search is that it's really easy to replace a HQL query by a Lucene query: just replace the Query object and you're done, the rest of the code remains unchanged. Because is that simple, people tend to use Hibernate Search and Lucene queries in a more widespread number of usecases, and not simply for a Yahoo-like search screen (we always talk about Google, let's switch for a while ;) ):
- save some DB CPU cycles and distribute it to cheaper machines
- efficient multi word queries
- wildcards
- etc
Here is a use case that is clearly not about plain text:
"increase visibility of all books where 'Paris Hilton' is mentioned and double the increase if 'prison' is also present"

Hibernate Search queries can return either managed objects or projected properties (retrieving only a subset of the data). When to use what?
Sometimes, you use property projections rather than object retrieval in HQL queries either for ease of use or performance reasons, It's more convinient to play with the objects, but you pick up the best tool for the job. I would say the same kind of rules can be applied with Hibernate Search between a regular query and a field projection.

Hibernate Search not suitable for high volume websites
I love this one. I did design high volume websites backed by Lucene. I know what you gain, I know what you lose. Hibernate Search is full of best practices. The Hibernate Search clustering support is a good example of architecture that an architect could mimic to scale with Lucene (up and out). But it's not the only one, it depends on the use case, that's why Hibernate Search does not impose an architecture, that's why I prefer libraries over off-the-shelves products.

I would recommend this off-the-shelves solution?
DBSight or Solr (which I know better) are interesting solutions indeed, but not for the same kind of projects, or at least not for the same integration strategy. We are comparing a library versus a black box. BTW DBSight has a 3-minutes install demo. I could not beat them, it took me 15 mins on stage at JavaOne ( but I walk and talk a lot :) )
I have never been a big fan of black boxes nicely integrated in my IT system, but if I had to choose such a solution I would also give the Google Search Appliance a try, the Google Mini is fairly cheap.

Anyway, Hibernate Search has been developed with practical solutions for practical problems, not theoretical considerations. Giving it a shot is the only way to judge.
Damn long post, sorry about that :(