In "Build a custom search engine with PHP," I combined PHP and the open source Sphinx search engine to create a blazing-fast alternative to text-intensive database queries, such as
LIKE
and, in the case of MySQL, MATCH
. (See Resources for Sphinx-related information.)Sphinx is easy to install and maintain, and is quite capable. Moreover, recent releases of Sphinx now provide a native MySQL engine, deprecating the need to run a separate Sphinx daemon. V0.9.8 (the most recent release as of this writing) also added geodistance queries to find records encompassed by a distance from a given location and a feature named multi-query, an optimization that bundles multiple queries and sets of results in a single network connection.
Sphinx continues to improve with time and is ideal for shopping sites, blogs, and many other applications. According to the Sphinx site, one application now indexes 700 million documents, or roughly 1.2 terabytes of data. I recommend Sphinx without hesitation.
However, Sphinx does not yet support several features you might like to employ and offer as your application or site becomes popular and usage increases. In particular, Sphinx does not yet automatically replicate or distribute its indices, making its daemon a single point of failure. (As a workaround, several machines can index the same database, and you can cluster those systems.) Sphinx does not highlight search results (like Google does when it displays cached pages), does not retain or cache recent results, and does not support regular expression (regex) or date-based operations.
If you seek those features or are ready for an enterprise-grade solution, consider the Apache Software Foundation's Solr project. Based on the Lucene search engine and provided as open source under the terms of the liberal Apache Software License, Solr is (according to the Lucene site) "an open source enterprise search server based on the Lucene Java™ search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a Web administration interface."
Among other notable, highly trafficked Web sites, Netflix, Digg, and CNET's News.com and CNET Reviews use Solr to power search. A lengthy list of public Solr-powered sites can be found in the Solr wiki (see Resources).
Learn how to use Solr and PHP to create a small application to search a database of automobile parts. While the example database contains only a handful of records, it could just as easily include millions. All the source code used in this article is available from the Download section.
Installing Solr
To combine Solr with PHP, you must install Solr, design an index, prepare your data to be indexed by Solr, load the index, write PHP code to execute queries, and present results. Much of the work required to create a searchable index can be performed from the command line. Of course, PHP's programmatic interface to Solr can also affect the contents of an index.Solr is implemented in Java technology. To run Solr and its administrative tools, you must install a Java V1.5 software development kit (Java 5 SDK). Several vendors provide a Java V1.5 SDK — for example, Sun Microsystems, IBM®, and BEA Systems— and each implementation is capable of powering Solr. Simply choose the Java package suited for your operating system and follow the appropriate instructions to complete the installation.
In many cases, the installation of Java V1.5 is as simple as running a self-extracting archive and accepting the terms of a license agreement. A script in the archive does all the heavy lifting in a matter of seconds. Other operating systems, such as Debian, provide the Java 5 SDK in the APT repository. For example, if you use Debian or Ubuntu, you can install the Java V1.5 software with
sudo apt-get install sun-java5-jdk
.Conveniently, APT also downloads all the dependencies required to use the Java 5 SDK automatically.
If the Java software is already installed and the Java executable file is in your
PATH
, run java -version
to determine which Java code you have.Here, let's use the Mac OS X V10.5 Leopard operating system as the basis of the demonstration. Apple's Leopard includes Java V1.5. With a small change to the default Apache configuration, Leopard runs PHP applications, too. Running
java -version
in a Leopard terminal window produces the following.
Listing 1. Run java -version
in a Leopard terminal window
$ which java /usr/bin/java $ java -version java version "1.5.0_13" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237) Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)
Figure 1. Java Preferences application in Leopard
To install Solr, visit Apache.org, click Resources > Download, select a convenient project mirror, and navigate within the folders shown to pick a tarball (a .tgz file) of Solr V1.2. The download transfers a file named something akin to apache-solr-1.2.0.tgz. Unpack the tarball with the following code.Listing 2. Unpack tarball
$ tar xzf apache-solr-1.2.0.tgz $ ls -F apache-solr-1.2.0 CHANGES.txt NOTICE.txt dist/ lib/ KEYS.txt README.txt docs/ src/ LICENSE.txt build.xml example/
The example directory contains a complete sample Solr application. To run it, simply launch the Java engine with the application archive: start.jar.
Listing 3. Launch Java engine
$ java -jar start.jar 2007-11-10 15:00:16.672::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2007-11-10 15:00:16.866::INFO: jetty-6.1.3 ... INFO: SolrUpdateServlet.init() done 2007-11-10 15:00:18.694::INFO: Started SocketConnector @ 0.0.0.0:8983
http://localhost:8983/solr/admin/
in the address bar. This is the interface for administering Solr. (To stop the Solr server, use Ctrl+C at the command line.)But there's no data in the Solr index to manage or query — yet.
Loading data into Solr
Solr is remarkably flexible out of the box, supporting a variety of data types and rules to create effective indices. And while broad, if the standard components do not suffice, you can further customize Solr by writing new Java classes.Given a set of data types and rules, you can then create a Solr schema to describe your data and control how the indices should be constructed. You then export your data to match the schema and load the data into Solr. Solr creates the indices on the fly, updating each index immediately as records are created, modified, or deleted.
The default Solr schema can be found at Apache.org as part of the Solr source code repository. For reference, a snippet of the default schema is shown below.
Listing 3. Default Solr schema snippet
<schema name="example" version="1.1"> ... <fields> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text" indexed="true" stored="true"/> <field name="nameSort" type="string" indexed="true" stored="false"/> <field name="cat" type="text" indexed="true" stored="true" multiValued="true"/> ... </fields> <uniqueKey>id</uniqueKey> ... <copyField source="name" dest="nameSort"/> ... </schema>
- As shown, the field
id
is a string (type="string"
) and should be indexed (indexed="true"
). It is also a required field (required="true"
). Using this schema, every record loaded in Solr must have a value for this field. The<uniqueKey>id</uniqueKey>
modifier further declares that theid
field must be unique. (Solr does not require a unique ID field; this is merely a rule established in the default index schema.) The attributestored="true"
indicates that theid
field should be retrievable. Why would you ever setstored
tofalse
? You can use a nonretrievable field to order results differently, as in the case ofnameSort
, which is a copy of thename
field (because of thecopyField
command on the last line), but has different behaviors. Notice thatnameSort
is astring
, whilename
istext
. The default index schema treats those two types slightly differently.
- The field
cat
ismultiValued
. A record may define several values for this field. For instance, if your application manages content, a story may be assigned several topics. You could use thecat
field (or define a similar field of your own) to capture all the topics.
Listing 4. Data formatted for the default Solr index schema
<add> <doc> <field name="id">F8V7067-APL-KIT</field> <field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field> <field name="manu">Belkin</field> <field name="cat">electronics</field> <field name="cat">connector</field> <field name="features">car power adapter, white</field> <field name="weight">4</field> <field name="price">19.95</field> <field name="popularity">1</field> <field name="inStock">false</field> </doc> <doc> <field name="id">IW-02</field> <field name="name">iPod & iPod Mini USB 2.0 Cable</field> <field name="manu">Belkin</field> <field name="cat">electronics</field> <field name="cat">connector</field> <field name="features">car power adapter for iPod, white</field> <field name="weight">2</field> <field name="price">11.50</field> <field name="popularity">1</field> <field name="inStock">false</field> </doc> </add>
add
element is a Solr command to add the enveloped records to the index. Each record is captured in a doc
element, which uses a series of named field
elements to specify field values. The fields weight
, price
, inStock
, manu
, features
, and popularity
are other fields defined in the default Solr index schema. The features
field has identical attributes to cat
, but has a different semantic meaning: It enumerates the (potentially many) capabilities of a product.Searching for auto parts
This example indexes a collection of auto parts. Each auto part has several fields, with a sample of the most important fields shown in Table 1. The name of the field is listed in the first column. The second column provides a brief description, while the third column lists its logical type. The fourth column shows the index type (as defined in the schema in Listing 5) used to represent the datum.Table 1. The fields of an auto part record
Name | Description | Type | Solr type |
---|---|---|---|
Part number (unique, mandatory) | An identifying number | String | partno |
Name | A concise description | String | name |
Model (required, multi-value) | A model, such as "Camaro" | String | model |
Model year (multi-value) | A model year, such as 2001 | String | year |
Price | Cost per unit | Float | price |
In stock | Within inventory or not | Boolean | inStock |
Features | Capabilities of part | String | features |
Timestamp | Record of activity | String | timestamp |
Weight | Shipping weight | Float | weight |
fields
element found in the default (as shown in Listing 1).Listing 5. The auto parts index schema
<?xml version="1.0" encoding="utf-8" ?> <schema name="autoparts" version="1.0"> ... <fields> <field name="partno" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text" indexed="true" stored="true" required="true" /> <field name="model" type="text_ws" indexed="true" stored="true" multiValued="true" required="true" /> <field name="year" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true" /> <field name="price" type="sfloat" indexed="true" stored="true" required="true" /> <field name="inStock" type="boolean" indexed="true" stored="true" default="false" /> <field name="features" type="text" indexed="true" stored="true" multiValued="true" /> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false" /> <field name="weight" type="sfloat" indexed="true" stored="true" /> </fields> <uniqueKey>partno</uniqueKey> <defaultSearchField>name</defaultSearchField> </schema>
Listing 6. A database of auto parts formatted for indexing
<add> <doc> <field name="partno">1</field> <field name="name">Spark plug</field> <field name="model">Boxster</field> <field name="model">924</field> <field name="year">1999</field> <field name="year">2000</field> <field name="price">25.00</field> <field name="inStock">true</field> </doc> <doc> <field name="partno">2</field> <field name="name">Windshield</field> <field name="model">911</field> <field name="year">1991</field> <field name="year">1999</field> <field name="price">15.00</field> <field name="inStock">false</field> </doc> </add>
Listing 7. Launching Solr with a new schema
$ cd apache-solr-1.2/example $ cp solr/conf/schema.xml solr/conf/default_schema.xml $ chmod a-w solr/conf/default_schema.xml $ vi /tmp/schema.xml ... $ cp /tmp/schema.xml solr/conf/schema.xml $ vi /tmp/parts.xml ... $ java -jar start.jar ... 2007-11-11 16:56:48.279::INFO: Started SocketConnector @ 0.0.0.0:8983 $ java -jar exampledocs/post.jar /tmp/parts.xml SimplePostTool: version 1.2 SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported SimplePostTool: POSTing files to http://localhost:8983/solr/update... SimplePostTool: POSTing file parts.xml SimplePostTool: COMMITting Solr index changes...
partno: 1 or partno: 2
.Your result should resemble this:
3 on 10 0 partno: 1 OR partno: 2 2.2 true Boxster 924 Spark plug 1 25.0 2007-11-11T21:58:45.899Z 1999 2000 false 911 Windshield 2 15.0 2007-11-11T21:58:45.953Z 1991 1999
You should also try editing and loading the data again. Because the
partno
field is declared unique, repeated upload operations of the same part
number merely replace the old index record with a new record. In
addition to the add
command, you can use commit
, optimize
, and delete
. The last command can delete a specific record by ID or many records through a query.And now for the PHP
Finally, PHP enters the example.There are at least two PHP Solr APIs. The most robust implementation is Donovan Jimenez's PHP Solr Client (see Resources). The code is licensed under the same terms as Solr, has extensive documentation, and is compatible with Solr V1.2. The most recent release as of this writing is dated 2 Oct 2007.
Solr Client provides four PHP classes:
Apache_Solr_Service
represents a Solr server. Use these methods to ping the server, add and delete documents, commit changes, optimize the index, and run queries.Apache_Solr_Document
embodies a Solr document. The methods of this class manage (key, value) pairs and multivalue fields. Field values can be accessed by direct dereferencing, such as$document->title = 'Something'; ... echo $document->title;
.Apache_Solr_Response
encapsulates a Solr response. This code depends on thejson_decode()
function, which is bundled with PHP V5.2.0 and later or can be installed with the PHP Extension Community Library (PECL — see Resources).Apache_Solr_Service_Balancer
enhancesApache_Solr_Service
, allowing you to connect to multiple Solr services in a distribution. This class is not covered here.
if (!is_array($this->_fields[$key])) { $this->_fields[$key] = array($this->_fields[$key]); } $this->_fields[$key][] = $value;
The code below shows a PHP application that connects a Solr service, adds two documents to the index, and runs the part number query used previously.
Listing 8. A sample PHP application to connect to, load, and query a Solr index
<?php require_once( 'Apache/Solr/Service.php' ); // // // Try to connect to the named server, port, and url // $solr = new Apache_Solr_Service( 'localhost', '8983', '/solr' ); if ( ! $solr->ping() ) { echo 'Solr service not responding.'; exit; } // // // Create two documents to represent two auto parts. // In practice, documents would likely be assembled from a // database query. // $parts = array( 'spark_plug' => array( 'partno' => 1, 'name' => 'Spark plug', 'model' => array( 'Boxster', '924' ), 'year' => array( 1999, 2000 ), 'price' => 25.00, 'inStock' => true, ), 'windshield' => array( 'partno' => 2, 'name' => 'Windshield', 'model' => '911', 'year' => array( 1999, 2000 ), 'price' => 15.00, 'inStock' => false, ) ); $documents = array(); foreach ( $parts as $item => $fields ) { $part = new Apache_Solr_Document(); foreach ( $fields as $key => $value ) { if ( is_array( $value ) ) { foreach ( $value as $datum ) { $part->setMultiValue( $key, $datum ); } } else { $part->$key = $value; } } $documents[] = $part; } // // // Load the documents into the index // try { $solr->addDocuments( $documents ); $solr->commit(); $solr->optimize(); } catch ( Exception $e ) { echo $e->getMessage(); } // // // Run some queries. Provide the raw path, a starting offset // for result documents, and the maximum number of result // documents to return. You can also use a fourth parameter // to control how results are sorted and highlighted, // among other options. // $offset = 0; $limit = 10; $queries = array( 'partno: 1 OR partno: 2', 'model: Boxster', 'name: plug' ); foreach ( $queries as $query ) { $response = $solr->search( $query, $offset, $limit ); if ( $response->getHttpStatus() == 200 ) { // print_r( $response->getRawResponse() ); if ( $response->response->numFound > 0 ) { echo "$query <br />"; foreach ( $response->response->docs as $doc ) { echo "$doc->partno $doc->name <br />"; } echo '<br />'; } } else { echo $response->getHttpStatusMessage(); } } ?>
ping()
method to verify that the server is operational.Next, the code translates the records represented as PHP arrays into Solr documents. If a field has a single value, a simple accessor adds the (key, value) pair to the document. If a field has multiple values, the list of values is assigned to the key with the special function
setMultiValue()
. You can see that this process closely resembles the XML representation of a Solr document.As an optimization,
addDocuments()
inserts multiple documents into the index. Subsequent commit()
and optimize()
functions finalize the additions.At the bottom, several queries retrieve data from the index. You can view the results through two lenses: The
getRawResponse()
function yields the entire, unparsed result, while the docs()
function returns an array of documents with named accessors.If a query does not get the OK from Solr, the code prints an error message. An empty result set emits no output.
More power
Solr is incredibly powerful, and the PHP API makes integration on any platform a snap. Better yet, Solr is easy to set up and operate, and you can enable advanced features as you need them. Best of all, Solr is free. Don't pay for a search engine. Save your greenbacks and go Solr.Surf the Solr Web site to learn more about advanced configuration, including sorting, categorized results, and replication. The Lucene Web site is another source of information because it's the search technology beneath the Solr system.
Download
Description | Name | Size |
---|---|---|
Sample PHP and Solr application | os-php-apachesolr.src.zip | 109KB |
Nguồn gốc: http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html
Không có nhận xét nào:
Đăng nhận xét