Building a Simple Boolean Search Engine with PHP/MySQL
New Version Released! October 21, 2007
PHP/MySQL Boolean Search Version 2.0 has been downloaded 1811 times since October 21, 2007.
- Fixed: various minor bugs
- Fixed: multipage results not available with quoted strings
- Fixed: full-text query problem
- Fixed: some compatibility issues with different PHP/MySQL configurations.
- Added: sortable results
- Added: configurable field label names
- Added: sample results page
- Added: test mode for troubleshooting.
Download PHP script and sample form. Database not included.
Version 1.5 was downloaded 1954 times between November 18, 2006 and October 21, 2007, and is no longer available for download.
The previous version was downloaded 480 times between May 14 and November 18th of 2006, and has been downloaded 285 times since then.
- More about Boolean Fulltext queries in MySQL
- Simple poll in PHP and MySQL
- Example of this search tool
- Processing Forms with PHP, A Beginner's Guide
When I was first getting started working with PHP and MySQL based websites, I found myself having a great deal of trouble finding an effective means of incorporating a simple PHP-driven search engine. My requirements were pretty simple - it needed to be able to examine the database for search phrases and provide links back to those pages. I finally did find a basic search engine but have since added a number of additional features to it including basic Boolean searching and basic protection against SQL injection attacks. This document runs through some of the basic code and the basics of how it works!
Warning: this script WILL require customization. You'll need at least a moderate handle on PHP variables, arrays, and MySQL database
commands to be able to work with it. The new version is substantially simplified in terms of installation and customization. Although
using it will still certainly require some PHP knowledge, the addition of a configuration file has simplified this substantially.
Search Tool Features
- Boolean support for "OR" and "AND" queries.
- Support for the whole range of MySQL full-text boolean queries.
- Exact phrase searching
(Default.)Exact phrase searching is no longer the default search. The default is now to search individually for the entered words. Phrase searching is requested with "quotes" - Semantically logical, valid and accessible markup for search results and query form.
- Form now offers two default options for results markup: results in a definition list (
<dl>) or in a table (<table>). - Basic database security.
The search engine only searches content contained within your MySQL database. It will not spider your site or locate any document outside your database.
This document was written in reference to the original search script (available here). It has been updated to reflect some of the changes in the new script, but is no longer a complete walk-through of the script.
Getting Started
The first part of the search processing is receiving data from the user. Inevitably, the first thing to be considered any time you're receiving any kind of information from a user is security. It's quite likely that 99% of users will never do anything but enter plain old text queries into your search form - but it only takes one person to cause a lot of trouble. So, we're going to take a pretty close look at the inputted data.
$var = @$_GET['q'];
if (ereg("^[A-z0-9+. -]*[']?[A-z0-9+. -]*$",$var)) {
$trimmed = trim($var);
} else {
$trimmed = "<span class='warning'>Invalid search term entered.</span>";
$testquery = FALSE;
}
The first step is to retrieve the query variable. In many cases it's not recommended to accept variables in the GET method because of security. However, using the POST method, users will not be able to bookmark their searches. For the sake of usability, I'm choosing to use the GET method. However, I'm examining that variable for a number of problems.
We first check the search string. Placing an @ symbol functions to suppress any error message that might arise because of the variable request. It can be helpful to suppress system errors in web applications to avoid giving an attacker additional information about the application. The search string is examined to see that it only contains normal word and number characters, with just a few exceptions. Blank spaces are allowed. Hyphens and periods are allowed. Single quotes are allowed if they appear between two other characters. I've attempted to balance usability with security - so almost any likely search is allowed. You can search for an abbreviated term such as "St. Paul", for contractions such as "you're", or for phone numbers such as "651-314-4353". Are there valid searches that can't be performed using this rule? Probably - please point them out to me if you notice one! One addition I've made in the recent script is to provide the option to permit searches on email addresses, configurable in the configuration file.
The important thing is that potentially dangerous characters such as "#" (used for comments in MySQL statements), "%", or " " are eliminated. These characters could be used to create a malicious attack on the database.
Why didn't I use mysql_real_escape_string?
Well, that's a good question. No, I mean it. I didn't use mysql_real_escape_string because I wanted practice using
regular expressions. There's no good security benefit to what I've done over PHP's built in escape function - and my expression
has the possibility of also eliminating some valid searches. If you use mysql_real_escape_string, you'll need to
check whether magic_quotes_gpc is enabled first in order to determine whether you need to stripslashes.
If you want, you can easily replace this bit of code as follows:
$var = @$_GET['q'];
if ($var) {
$trim1 = trim($var);
if (get_magic_quotes_gpc()) {
$trim1 = stripslashes($trim1);
}
$trimmed = mysql_real_escape_string($trim1);
} else {
$trimmed = "<span class='warning'>Invalid search term entered.</span>";
$testquery = FALSE;
}
Next, the search query is trimmed of excess whitespace. Any blank spaces at the beginning or end of the query string are stripped. Although there is a valid argument for leaving these spaces present (they could be an intended part of a search, for example), it's most likely that users have left extra white spaces by accident.
Last, if the regular expression test fails, the search query is replaced with a warning which will be printed to the screen at a later
point in the code. The variable $testquery is also set to FALSE. This will trigger a later decision tree which will bypass
the regular database query stage.
Summary to date:
- Search string is examined.
- If it passes, whitespace is trimmed and it's placed in the variable
$trimmed. - If it fails, the search query is replaced with a default error message and a Boolean variable is set which will bypass the normal query phase of the script.
Handling the Boolean Phrases
if (ereg(" AND | and | And ",$trimmed,$matches)) {
$burst = $matches[0];
$terms = explode($burst,$trimmed);
$boolean = TRUE;
$split = "AND";
} elseif (ereg(" OR | or | Or ",$trimmed, $matches)) {
$burst = $matches[0];
$terms = explode($burst,$trimmed);
$boolean = TRUE;
$split = "OR";
}
I've skipped forward a little, and we're now inspecting the search query for key Boolean phrases.
I'm keeping this script simple, so I'm only looking for AND and OR. First thing for each phrase is to
identify the presence of that character sequence in the search phrase. But it needs to be more than just
any string "and". A valid string must be a separate word, not contained within another word, so
I'm actually looking for occurrences of the string surrounded by spaces on both sides. The string may also
appear in any combination of upper and lower case - for simplicity's sake, I'm only examining the three most
likely: all capitals, no capitals, and initial capital. Current version actually does look at several of
the other odd choices for capitalization.
Having found a match, I use the explode() function to separate the search phrase into separate
variables, stripping out the Boolean term itself and recording which term was located in the variable
$split.
Handling the MySQL Queries
if (($boolean == TRUE) && ($split == "OR")) {
$query = "SELECT cid AS id, title, category, content
FROM content WHERE
MATCH(title,category,content,author) AGAINST ('$terms[0] $terms[1] $terms[2] $terms[3]' IN BOOLEAN MODE)
ORDER BY category, title";
} elseif (($boolean == TRUE) && ($split == "AND")) {
$query = "SELECT cid AS id, title, category, content
FROM content WHERE
MATCH(title,category,content,author) AGAINST ('+$terms[0] +$terms[1] +$terms[2] +$terms[3]' IN BOOLEAN MODE)
ORDER BY category, title";
} else {
$query = "
(SELECT cid AS id, title, category, content
FROM content WHERE
title LIKE \"%$trimmed%\" OR category LIKE \"%$trimmed%\" OR content LIKE \"%$trimmed%\" OR author LIKE \"%$trimmed%\")
ORDER BY category, title";
}
$sub_query = "
(SELECT cid AS id, title, category
FROM content
WHERE cid LIKE \"invalid\")"
if ($testquery == FALSE) {
$numresults = 0
$numrows = 0
} else {
$numresults=mysql_query($query);
$numrows=mysql_num_rows($numresults);
}
This block of code includes four separate MySQL queries and a selection method which will dictate which query is run for a given query. Read more about Boolean queries in MySQL. In order, the queries are:
- Boolean OR,
- Boolean AND,
- Default, and
- Invalid Query.
The circumstances which trigger each query have already been discussed, but the queries themselves bear a little examination.
The two boolean enabled queries follow the same format with only one significant difference between them. First, the query
specifies which fields it is SELECTing. These are the names of fields in your table which will be provided to the code as variables
to be used later on. As written, the script requires fields called ID, TITLE, CATEGORY and CONTENT. The "cid AS id" syntax
is a means by which you can apply this syntax to the appropriate fields in your preexisting table. In this example, the field CID
in table CONTENT is being relabeled for the purposes of the script as ID.
Although the script still works essentially the same way, much of the above is no longer true: all three positive queries are now made in Boolean mode, the script no longer requires fields called ID, TITLE, CATEGORY or CONTENT (although it does require equivalent fields), and the "cid AS id" syntax is no longer needed. In the current version, the default query (when neither AND nor OR appear in the query) is a Boolean query which allows the user to input Boolean syntax. The queried fields of your MySQL table are set in the configuration file.
The meat of the query is in the MATCH/AGAINST statements. You need to specify the fields which are being searched as a comma
separated string in the MATCH statement, followed by the specification of AGAINST - the terms which are being searched for. The
command phrase IN BOOLEAN MODE informs the query that these terms should be used in Boolean fashion. This is where the difference
between the two queries lies. In the OR statement, the variables $terms[n] have no signifiers, which allows any combination of these
terms to provide a match. In the AND statement, the variables are preceded by a + (plus), indicating that the query must find BOTH
terms in order to provide a match. As this query is written, it allows a maximum of four six separate search terms for the Boolean query (
$terms[0] $terms[1] $terms[2] $terms[3] $terms[4] $terms[5]). It would be very easy to add additional terms - simply include $terms[, etc.
However, I feel it is unlikely that more than four separate terms are likely to be needed.
46]
Finally, the results are sorted - first according to category, then by title. Now, according to the order you set
in the configuration file.
The default query is very similar, except without the Boolean syntax. In this query, the phrase has not been passed through
the explode() function and is just one phrase. This phrase is compared against the same fields using the LIKE
syntax, which provides matches for any record which contains that phrase anywhere in the checked fieldsBoolean search queries.
Finally, the invalid query is created to guarantee no results. It would be possible to simply not run a query for invalid
searches, but I've chosen to provide a query because I can then avoid having to code an entire alternate decision train
for error consequences. This way, the normal "no results" decision path can take over for invalid queries as well.
The query is simply checking for the existence of the word invalidin8valis1d within the field CID. Since this field is the
primary key for the table and contains nothing but integers, this will always fail.
At the end of these four queries, a simple if/then decision is made. If the variable $testquery has been set to FALSE,
then the invalid results query is run instead of the normal query.
Printing out the Results
while ($row = mysql_fetch_array($result)) {
$title = $row["title"];
$id = $row["id"];
$blurb = substr($row["content"],0,100) . ' . . .';
$blurb1 = strip_tags($blurb);
$section = $row["category"];
echo "<tr>
<td class='sr_number'>$count</td>
<td class='sr_page'>
<a href='http://www.yoursite.com/index.php?section=$section&doc=$id&?q=$trimmed'>$title</a>
<p>$blurb1</p>
</td>
<td class='sr_section'>$section</td></tr>";
$count++ ;
}
Another skip forward here, to the next tricky bit - retrieving the results. To provide a usable results page, I'm printing out the count number of the result, a link to the resource using the title of the record as the anchor text, a short blurb from the main content of the record, and a section label. What you will want to provide will, of course, depend on your application of this search. However, the tricky and crucial element is the link to your resource. The structure of this will always vary depending on how your website is retrieving documents.
In this particular example, it's a pretty straightforward use of GET variables. You need to provide the document ID to retrieve
the exact record. The rest of this is pretty optional, depending on your use. If you don't know how to work with the $_GET
array you may find yourself having trouble implementing this search engine!
Conclusion
This search engine can provide an acceptable level of security and functionality to a small website. If you have the knowledge to work with a PHP and MySQL based dynamic site, this could be useful. The use of a full-scale content management system is likely to provide a much stronger level of security and functionality. However, if you like being able to implement each element of your site by yourself then go right ahead!
Download PHP script and sample form. Database not included.
The current version of this script has been downloaded 1954 times since November 20, 2006.
The previous version was downloaded 480 times between May 14 and November 20th of 2006.
For Further Reading:
If you want better information on MySQL's advanced uses and techniques, I recommend buying a book such as High Performance MySQL, by Jeremy Zawodny and Derek Balling or the MySQL Cookbook, by Paul DuBois. Either of these will give you great information - the first book is more about specific methods for improving the performance of your MySQL queries, the second will give you valuable solutions to a wide variety of MySQL programming problems.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs2.5 License.