Tuesday, December 25, 2012

Solr Functions in Action

The DataImportHandler is great contrib which provides methods to import data into Solr from relational databases. It operates in two modes "full build" and "incremental updates". Delta import calculates changed items then executes query to extract data from the source. It spawns multiple round-trips between Solr and datasource which is often an undesirable behavior. Moreover sometimes it causes "out of memory" exception. So that authors suggest an alternative way by using the same query for both full and delta updates distinguishing them with request and "dataimporter.*" parameters.

<entity name="item" pk="ID"
query="SELECT *
FROM v_index
WHERE '${dataimporter.request.clean}' != 'false'
OR id IN
(SELECT id
FROM t_item i
WHERE ('${dataimporter.request.id}' IS NULL
AND i.updated >= TO_DATE ('${dataimporter.last_index_time}', 'yyyy-mm-dd hh24:mi:ss')))">

What's wrong here? Certainly an XML attribute isn't the best place for writing SQL text. It's better to keep SQL in files (no need to to escape characters, highlighting and so on). Second painful thing is pretty complex query. Such constructions often cause bad execution plans, especially when query runs on complicated views. So there are two options: tune query (it may become overtuned or non-portable soon) or make query simpler.

I've written a few functions. They are very simple in itself but together they produce really great cumulative effect:

  • decode is remake of Oracle's decode
  • load reads query from file
  • run executes statements

Here is rewritten configuration file:

<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<function name="decode" class="solr.f.Decode"/>
<function name="load" class="solr.f.LoadQuery"/>
<function name="run" class="solr.f.RunQuery"/>
<document>
<entity
name="item"
query="${dataimporter.functions.load(
dataimporter.functions.decode(
dataimporter.request.clean,
'false', 'delta.sql',
'full.sql'))}">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<field column="MANU" name="manu" />
<field column="WEIGHT" name="weight" />
<field column="PRICE" name="price" />
<field column="POPULARITY" name="popularity" />
<field column="INSTOCK" name="inStock" />
<field column="INCLUDES" name="includes" />
</entity>
</document>
</dataConfig>
view raw data-config.xml hosted with ❤ by GitHub

This is query for full index. Note what 'before-full.sql' runs before query execution:

/* Run material views updating
${dataimporter.functions.run('before-full.sql')} */
select * from v_index_full
view raw full.sql hosted with ❤ by GitHub
begin
refresh_mviews;
end;
view raw before-full.sql hosted with ❤ by GitHub

Completely different table can be used in delta update:

select *
from
v_index_delta
where id in
(select id from t_item i
where (
'${dataimporter.request.id}' is null
and i.updated >= to_date(
'${dataimporter.last_index_time}', 'yyyy-mm-dd hh24:mi:ss')))
view raw delta.sql hosted with ❤ by GitHub

Please find functions sources.

No comments:

Post a Comment