- Solr Cookbook(Third Edition)
- Rafa? Ku?
- 664字
- 2021-08-06 19:39:24
Incremental imports with DIH
In most use cases, indexing the data from scratch during every indexation doesn't make sense. Why index your 1,00,000 documents when only 1,000 were modified or added? This is where the Solr Data Import Handler delta queries come in handy. Using them, we can index our data incrementally. This recipe will show you how to set up the Data Import Handler to use delta queries and index data in an incremental way.
Getting ready
Refer to the Indexing data from a database using Data Import Handler recipe in this chapter to get to know the basics of the Data Import Handler configuration. I assume that Solr is set up according to the description given in the mentioned recipe.
How to do it...
We will reuse parts of the configuration shown in the Indexing data from a database using Data Import Handler recipe in this chapter, and we will modify it. Execute the following steps:
- The first thing you should do is add an additional column to the tables you use, a column that will specify the last modification date of the record. So, in our case, let's assume that we added a column named
last_modified
(which should be a timestamp-based column). Now, ourdb-data-config.xml
will look like this:<dataConfig> <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/users" user="users" password="secret" /> <document> <entity name="user" query="SELECT user_id, user_name FROM users" deltaImportQuery="select user_id, user_name FROM users WHERE user_id = '${dih.delta.user_id}'" deltaQuery="SELECT user_id FROM users WHERE last_modified > '${dih.last_index_time}'"> <field column="user_id" name="id" /> <field column="user_name" name="name" /> <entity name="user_desc" query="select desc from users_description where user_id=${user.user_id}"> <field column="desc" name="description" /> </entity> </entity> </document> </dataConfig>
- After this, we run a new kind of query to start the delta import:
http://localhost:8983/solr/cookbook/dataimport?command=delta-import
How it works...
First, we modified our database table to include a column named last_modified
. We need to ensure that the column will contain the last modified date of the record it corresponds to. Solr will not modify the database, so you have to ensure that your application will do this.
When running a delta import, the Data Import Handler will start by reading a file named dataimport.properties
inside a Solr configuration directory. If it is not present, the Data Import Handler will assume that no indexing was ever made. Solr will use this file to store information about the last indexation time, and this file will be updated or created after indexation is finished. The last index time will be stored as a timestamp. As you can guess, the Data Import Handler uses this timestamp to distinguish whether the data was changed. It can be used in a query by using a special variable, ${dih.last_index_time}
.
You might already have noticed the two differences—two additional attributes defining entities named user
, deltaQuery
, and deltaImportQuery
. The deltaQuery
attribute is responsible for getting the information about users that were modified since the last index. Actually, it only gets the users' unique identifiers and uses the last_modified
column we added to determine which users were modified since the last import. The deltaImportQuery
attribute gets users with the appropriate unique identifier (which was returned by deltaQuery
) to get all the needed information about the user. One thing worth noticing is the way that I used the user identifier in the deltaImportQuery
attribute; we did this using ${dih.delta.user_id}
. We used the dih.delta
variable with its user_id
property (which is the same as the table column name) to refer to the user identifier.
You might notice that I left the query
attribute in the entity definition. It's left on purpose; you might need to index the full data once again so that the configuration will be useful for full as well as partial imports.
Next, we have a query that shows how to run the delta import. You might notice that compared to the full import, we didn't use the full-import
command; we sent the delta-import
command instead.
The statuses that are returned by Solr are the same as those with the full import, so refer to the appropriate chapters to see what information they carry.
One more thing—the delta queries are only supported for the default SqlEntityProcessor
. This means that you can only use these queries with JDBC data sources.
See also
- For information about the efficiency of a Data Import Handler, full and delta imports, refer to http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
- Node.js+Webpack開發實戰
- HTML5移動Web開發技術
- Java面向對象軟件開發
- Microsoft Dynamics 365 Extensions Cookbook
- Scratch 3.0少兒編程與邏輯思維訓練
- 數據結構習題解析與實驗指導
- Extending Puppet(Second Edition)
- 智能手機故障檢測與維修從入門到精通
- App Inventor少兒趣味編程動手做
- Building Slack Bots
- C語言程序設計實踐
- Scratch編程從入門到精通
- 計算機程序的構造和解釋(JavaScript版)
- Python滲透測試編程技術:方法與實踐(第2版)
- JavaScript Mobile Application Development