Discover how to combine an enterprise-worthy search engine — Apache Software Foundation's Solr — with your PHP application.
In "
Build a custom search engine with PHP,"
I combined PHP and the open source Sphinx search engine to create a
blazing-fast alternative to text-intensive database queries, such as
LIKE
and, in the case of MySQL,
MATCH
. (See
Resources for Sphinx-related information.)
Sphinx is easy to install and maintain, and is quite capable.
Moreover, recent releases of Sphinx now provide a native MySQL engine,
deprecating the need to run a separate Sphinx daemon. V0.9.8 (the most
recent release as of this writing) also added
geodistance queries to find records encompassed by a distance from a given location and a feature named
multi-query, an optimization that bundles multiple queries and sets of results in a single network connection.
Sphinx continues to improve with time and is ideal for shopping
sites, blogs, and many other applications. According to the Sphinx site,
one application now indexes 700 million documents, or roughly 1.2
terabytes of data. I recommend Sphinx without hesitation.
However, Sphinx does not yet support several features you might like
to employ and offer as your application or site becomes popular and
usage increases. In particular, Sphinx does not yet automatically
replicate or distribute its indices, making its daemon a single point of
failure. (As a workaround, several machines can index the same
database, and you can cluster those systems.) Sphinx does not highlight
search results (like Google does when it displays cached pages), does
not retain or cache recent results, and does not support regular
expression (regex) or date-based operations.
If you seek those features or are ready for an enterprise-grade
solution, consider the Apache Software Foundation's Solr project. Based
on the
Lucene
search engine and provided as open source under the terms of the
liberal Apache Software License, Solr is (according to the Lucene site)
"an open source enterprise search server based on the Lucene Java™
search library, with XML/HTTP and JSON APIs, hit highlighting, faceted
search, caching, replication, and a Web administration interface."
Among other notable, highly trafficked Web sites, Netflix, Digg, and
CNET's News.com and CNET Reviews use Solr to power search. A lengthy
list of public Solr-powered sites can be found in the Solr wiki (see
Resources).
Learn how to use Solr and PHP to create a small application to search
a database of automobile parts. While the example database contains
only a handful of records, it could just as easily include millions. All
the source code used in this article is available from the
Download section.
Installing Solr
To combine Solr with PHP, you must install Solr, design an index,
prepare your data to be indexed by Solr, load the index, write PHP code
to execute queries, and present results. Much of the work required to
create a searchable index can be performed from the command line. Of
course, PHP's programmatic interface to Solr can also affect the
contents of an index.
Solr is implemented in Java technology. To run Solr and its
administrative tools, you must install a Java V1.5 software development
kit (Java 5 SDK). Several vendors provide a Java V1.5 SDK — for example,
Sun Microsystems,
IBM®, and
BEA Systems—
and each implementation is capable of powering Solr. Simply choose the
Java package suited for your operating system and follow the appropriate
instructions to complete the installation.
In many cases, the installation of Java V1.5 is as simple as running a
self-extracting archive and accepting the terms of a license agreement.
A script in the archive does all the heavy lifting in a matter of
seconds. Other operating systems, such as Debian, provide the Java 5 SDK
in the APT repository. For example, if you use Debian or Ubuntu, you
can install the Java V1.5 software with
sudo apt-get install sun-java5-jdk
.
Conveniently, APT also downloads all the dependencies required to use the Java 5 SDK automatically.
If the Java software is already installed and the Java executable file is in your
PATH
, run
java -version
to determine which Java code you have.
Here, let's use the Mac OS X V10.5 Leopard operating system as the
basis of the demonstration. Apple's Leopard includes Java V1.5. With a
small change to the default Apache configuration, Leopard runs PHP
applications, too. Running
java -version
in a Leopard terminal window produces the following.
Listing 1. Run java -version
in a Leopard terminal window
$ which java
/usr/bin/java
$ java -version
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237)
Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)
Note: Leopard allows you to switch between Java V1.4
and V1.5 in the Java Preferences application in
/Applications/Utilities/Java. If your installation of Leopard reports
V1.4, open Java Preferences and change the settings to resemble Figure
1.
Figure 1. Java Preferences application in Leopard
To install Solr, visit
Apache.org, click
Resources > Download,
select a convenient project mirror, and navigate within the folders
shown to pick a tarball (a .tgz file) of Solr V1.2. The download
transfers a file named something akin to
apache-solr-1.2.0.tgz. Unpack the tarball with the following code.
Listing 2. Unpack tarball
$ tar xzf apache-solr-1.2.0.tgz
$ ls -F apache-solr-1.2.0
CHANGES.txt NOTICE.txt dist/ lib/
KEYS.txt README.txt docs/ src/
LICENSE.txt build.xml example/
In the newly created directory, the folder named
dist
contains the Solr code bundled as a Java archive (JAR). The subdirectory
example/exampledocs contains examples of data that's formatted —
typically as XML code — and ready for Solr to index.
The
example directory contains a complete sample Solr
application. To run it, simply launch the Java engine with the
application archive: start.jar.
Listing 3. Launch Java engine
$ java -jar start.jar
2007-11-10 15:00:16.672::INFO: Logging to STDERR via org.mortbay.log.StdErrLog
2007-11-10 15:00:16.866::INFO: jetty-6.1.3
...
INFO: SolrUpdateServlet.init() done
2007-11-10 15:00:18.694::INFO: Started SocketConnector @ 0.0.0.0:8983
The application is now available on port 8983. Start your browser and type
http://localhost:8983/solr/admin/
in the address bar. This is the interface for administering Solr. (To stop the Solr server, use
Ctrl+C at the command line.)
But there's no data in the Solr index to manage or query — yet.
Loading data into Solr
Solr is remarkably flexible out of the box, supporting a variety of
data types and rules to create effective indices. And while broad, if
the standard components do not suffice, you can further customize Solr
by writing new Java classes.
Given a set of data types and rules, you can then create a Solr
schema to describe your data and control how the indices should be
constructed. You then export your data to match the schema and load the
data into Solr. Solr creates the indices on the fly, updating each index
immediately as records are created, modified, or deleted.
The default Solr schema can be found at
Apache.org as part of the Solr source code repository. For reference, a snippet of the default schema is shown below.
Listing 3. Default Solr schema snippet
<schema name="example" version="1.1">
...
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text" indexed="true" stored="true"/>
<field name="nameSort" type="string" indexed="true" stored="false"/>
<field name="cat" type="text" indexed="true" stored="true" multiValued="true"/>
...
</fields>
<uniqueKey>id</uniqueKey>
...
<copyField source="name" dest="nameSort"/>
...
</schema>
Much of the schema is self-explanatory, but some aspects warrant clarification:
- As shown, the field
id
is a string (type="string"
) and should be indexed (indexed="true"
). It is also a required field (required="true"
). Using this schema, every record loaded in Solr must have a value for this field. The <uniqueKey>id</uniqueKey>
modifier further declares that the id
field must be unique. (Solr does not require a unique ID field; this is
merely a rule established in the default index schema.) The attribute stored="true"
indicates that the id
field should be retrievable.
Why would you ever set stored
to false
? You can use a nonretrievable field to order results differently, as in the case of nameSort
, which is a copy of the name
field (because of the copyField
command on the last line), but has different behaviors. Notice that nameSort
is a string
, while name
is text
. The default index schema treats those two types slightly differently.
- The field
cat
is multiValued
. A record
may define several values for this field. For instance, if your
application manages content, a story may be assigned several topics. You
could use the cat
field (or define a similar field of your own) to capture all the topics.
Listing 4 shows the file example/exampledocs/ipod_other.xml, which represents two entries in a catalog of iPod accessories.
Listing 4. Data formatted for the default Solr index schema
<add>
<doc>
<field name="id">F8V7067-APL-KIT</field>
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field>
<field name="manu">Belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter, white</field>
<field name="weight">4</field>
<field name="price">19.95</field>
<field name="popularity">1</field>
<field name="inStock">false</field>
</doc>
<doc>
<field name="id">IW-02</field>
<field name="name">iPod & iPod Mini USB 2.0 Cable</field>
<field name="manu">Belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter for iPod, white</field>
<field name="weight">2</field>
<field name="price">11.50</field>
<field name="popularity">1</field>
<field name="inStock">false</field>
</doc>
</add>
The
add
element is a Solr command to add the enveloped records to the index. Each record is captured in a
doc
element, which uses a series of named
field
elements to specify field values. The fields
weight
,
price
,
inStock
,
manu
,
features
, and
popularity
are other fields defined in the default Solr index schema. The
features
field has identical attributes to
cat
, but has a different semantic meaning: It enumerates the (potentially many) capabilities of a product.
Searching for auto parts
This example indexes a collection of auto parts. Each auto part has
several fields, with a sample of the most important fields shown in
Table 1. The name of the field is listed in the first column. The second
column provides a brief description, while the third column lists its
logical type. The fourth column shows the index type (as defined in the
schema in
Listing 5) used to represent the datum.
Table 1. The fields of an auto part record
Name |
Description |
Type |
Solr type |
Part number (unique, mandatory) |
An identifying number |
String |
partno |
Name |
A concise description |
String |
name |
Model (required, multi-value) |
A model, such as "Camaro" |
String |
model |
Model year (multi-value) |
A model year, such as 2001 |
String |
year |
Price |
Cost per unit |
Float |
price |
In stock |
Within inventory or not |
Boolean |
inStock |
Features |
Capabilities of part |
String |
features |
Timestamp |
Record of activity |
String |
timestamp |
Weight |
Shipping weight |
Float |
weight |
Listing 3 shows a portion of the Solr schema used for the auto parts
index. It's largely based on the default Solr schema. The specific
fields used — the names and attributes — simply replaced the
fields
element found in the default (as shown in
Listing 1).
Listing 5. The auto parts index schema
<?xml version="1.0" encoding="utf-8" ?>
<schema name="autoparts" version="1.0">
...
<fields>
<field name="partno" type="string" indexed="true"
stored="true" required="true" />
<field name="name" type="text" indexed="true"
stored="true" required="true" />
<field name="model" type="text_ws" indexed="true" stored="true"
multiValued="true" required="true" />
<field name="year" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" />
<field name="price" type="sfloat" indexed="true"
stored="true" required="true" />
<field name="inStock" type="boolean" indexed="true"
stored="true" default="false" />
<field name="features" type="text" indexed="true"
stored="true" multiValued="true" />
<field name="timestamp" type="date" indexed="true"
stored="true" default="NOW" multiValued="false" />
<field name="weight" type="sfloat" indexed="true" stored="true" />
</fields>
<uniqueKey>partno</uniqueKey>
<defaultSearchField>name</defaultSearchField>
</schema>
Given the fields above, a database of auto parts exported and formatted for uploading into Solr might look like Listing 6.
Listing 6. A database of auto parts formatted for indexing
<add>
<doc>
<field name="partno">1</field>
<field name="name">Spark plug</field>
<field name="model">Boxster</field>
<field name="model">924</field>
<field name="year">1999</field>
<field name="year">2000</field>
<field name="price">25.00</field>
<field name="inStock">true</field>
</doc>
<doc>
<field name="partno">2</field>
<field name="name">Windshield</field>
<field name="model">911</field>
<field name="year">1991</field>
<field name="year">1999</field>
<field name="price">15.00</field>
<field name="inStock">false</field>
</doc>
</add>
Let's install the new index schema and load the data into Solr. First, stop the Solr daemon (if it's still running) by using
Ctrl+C.
Make an archive of the existing Solr schema in
example/solr/conf/schema.xml. Next, create a text file from Listing 6,
save it to /tmp/schema.xml, and copy it to example/solr/conf/schema.xml.
Create another file for the data shown in Listing 7. Now, you can start
Solr again and use the posting utility provided with the example.
Listing 7. Launching Solr with a new schema
$ cd apache-solr-1.2/example
$ cp solr/conf/schema.xml solr/conf/default_schema.xml
$ chmod a-w solr/conf/default_schema.xml
$ vi /tmp/schema.xml
...
$ cp /tmp/schema.xml solr/conf/schema.xml
$ vi /tmp/parts.xml
...
$ java -jar start.jar
...
2007-11-11 16:56:48.279::INFO: Started SocketConnector @ 0.0.0.0:8983
$ java -jar exampledocs/post.jar /tmp/parts.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8,
other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update...
SimplePostTool: POSTing file parts.xml
SimplePostTool: COMMITting Solr index changes...
Success! If you want to verify that the index exists and contains two
documents, point your browser again to
http://localhost:8983/solr/admin/. You should see "(autoparts)" at the
top of the page. If so, click the query box at midpage and type
partno: 1 or partno: 2
.
Your result should resemble this:
3 on 10 0 partno: 1 OR partno: 2 2.2
true Boxster 924 Spark plug 1 25.0 2007-11-11T21:58:45.899Z 1999 2000
false 911 Windshield 2 15.0 2007-11-11T21:58:45.953Z 1991 1999
Try some other queries. The syntax for Lucene queries.
You should also try editing and loading the data again. Because the
partno
field is declared unique, repeated upload operations of the same part
number merely replace the old index record with a new record. In
addition to the
add
command, you can use
commit
,
optimize
, and
delete
. The last command can delete a specific record by ID or many records through a query.
And now for the PHP
Finally, PHP enters the example.
There are at least two PHP Solr APIs. The most robust implementation is Donovan Jimenez's PHP Solr Client (see
Resources).
The code is licensed under the same terms as Solr, has extensive
documentation, and is compatible with Solr V1.2. The most recent release
as of this writing is dated 2 Oct 2007.
Solr Client provides four PHP classes:
Apache_Solr_Service
represents a Solr server. Use
these methods to ping the server, add and delete documents, commit
changes, optimize the index, and run queries.
Apache_Solr_Document
embodies a Solr document. The
methods of this class manage (key, value) pairs and multivalue fields.
Field values can be accessed by direct dereferencing, such as $document->title = 'Something'; ... echo $document->title;
.
Apache_Solr_Response
encapsulates a Solr response. This code depends on the json_decode()
function, which is bundled with PHP V5.2.0 and later or can be installed with the PHP Extension Community Library (PECL — see Resources).
Apache_Solr_Service_Balancer
enhances Apache_Solr_Service
, allowing you to connect to multiple Solr services in a distribution. This class is not covered here.
Download the PHP Solr Client (see
Resources)
and extract it to a working directory. Change to the SolrPhpClient.
Next, check the file Apache/Solr/Service.php. At the time of this
writing, line 335 was missing a trailing semicolon. Edit the file, and
add the semicolon, if necessary. Also, check the file
Apache/Solr/Document.php. Lines 112-117 should read as follows.
if (!is_array($this->_fields[$key]))
{
$this->_fields[$key] = array($this->_fields[$key]);
}
$this->_fields[$key][] = $value;
After you correct the files, you can install the
Apache directory alongside your other PHP libraries.
The code below shows a PHP application that connects a Solr service,
adds two documents to the index, and runs the part number query used
previously.
Listing 8. A sample PHP application to connect to, load, and query a Solr index
<?php
require_once( 'Apache/Solr/Service.php' );
//
//
// Try to connect to the named server, port, and url
//
$solr = new Apache_Solr_Service( 'localhost', '8983', '/solr' );
if ( ! $solr->ping() ) {
echo 'Solr service not responding.';
exit;
}
//
//
// Create two documents to represent two auto parts.
// In practice, documents would likely be assembled from a
// database query.
//
$parts = array(
'spark_plug' => array(
'partno' => 1,
'name' => 'Spark plug',
'model' => array( 'Boxster', '924' ),
'year' => array( 1999, 2000 ),
'price' => 25.00,
'inStock' => true,
),
'windshield' => array(
'partno' => 2,
'name' => 'Windshield',
'model' => '911',
'year' => array( 1999, 2000 ),
'price' => 15.00,
'inStock' => false,
)
);
$documents = array();
foreach ( $parts as $item => $fields ) {
$part = new Apache_Solr_Document();
foreach ( $fields as $key => $value ) {
if ( is_array( $value ) ) {
foreach ( $value as $datum ) {
$part->setMultiValue( $key, $datum );
}
}
else {
$part->$key = $value;
}
}
$documents[] = $part;
}
//
//
// Load the documents into the index
//
try {
$solr->addDocuments( $documents );
$solr->commit();
$solr->optimize();
}
catch ( Exception $e ) {
echo $e->getMessage();
}
//
//
// Run some queries. Provide the raw path, a starting offset
// for result documents, and the maximum number of result
// documents to return. You can also use a fourth parameter
// to control how results are sorted and highlighted,
// among other options.
//
$offset = 0;
$limit = 10;
$queries = array(
'partno: 1 OR partno: 2',
'model: Boxster',
'name: plug'
);
foreach ( $queries as $query ) {
$response = $solr->search( $query, $offset, $limit );
if ( $response->getHttpStatus() == 200 ) {
// print_r( $response->getRawResponse() );
if ( $response->response->numFound > 0 ) {
echo "$query <br />";
foreach ( $response->response->docs as $doc ) {
echo "$doc->partno $doc->name <br />";
}
echo '<br />';
}
}
else {
echo $response->getHttpStatusMessage();
}
}
?>
To begin, the code connects to the named Solr server on the port and path given, and uses the
ping()
method to verify that the server is operational.
Next, the code translates the records represented as PHP arrays into
Solr documents. If a field has a single value, a simple accessor adds
the (key, value) pair to the document. If a field has multiple values,
the list of values is assigned to the key with the special function
setMultiValue()
. You can see that this process closely resembles the XML representation of a Solr document.
As an optimization,
addDocuments()
inserts multiple documents into the index. Subsequent
commit()
and
optimize()
functions finalize the additions.
At the bottom, several queries retrieve data from the index. You can view the results through two lenses: The
getRawResponse()
function yields the entire, unparsed result, while the
docs()
function returns an array of documents with named accessors.
If a query does not get the OK from Solr, the code prints an error message. An empty result set emits no output.
More power
Solr is incredibly powerful, and the PHP API makes integration on any
platform a snap. Better yet, Solr is easy to set up and operate, and
you can enable advanced features as you need them. Best of all, Solr is
free. Don't pay for a search engine. Save your greenbacks and go Solr.
Surf the Solr Web site to learn more about advanced configuration,
including sorting, categorized results, and replication. The Lucene Web
site is another source of information because it's the search technology
beneath the Solr system.
Download
Nguồn gốc: http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html