Cataloging SQL Data and Almost Anything Else
This How-To describes specifically how to get SQL data into a ZCatalog for searching, but the principles here can be used to put just about anything into the catalog. From: http://zope.org/Members/rbickers/cataloganything
Overview
This How-To describes specifically how to get SQL data into a ZCatalog for searching, but the principles here can be used to put just about anything into the catalog.
Disclaimer: This information is based on my experience using ZCatalog with SQL and other external data. It is not based on deep knowledge of the inner workings of the catalog, so this may or may not be making use of the catalog as it was designed to be used.
ZCatalog Basics (Plus)
There are some basic things about ZCatalog that you should understand before trying this procedure. If are not already very familiar with how ZCatalog works, you should read Chapter 9 of the Zope Book before you continue.
To summarize what you need to know, a ZCatalog (hereinafter
referred to as just the catalog) stores 2 main types of information for
each object in the catalog: indexes and meta data. The indexes are used
during the search to find objects based on search parameters, and the
meta data is made available as the results of a matching object. For
example, if I have an index PrincipiaSearchSource that contains the text of a document object, I can search document text by using that index. If I have a meta data title, I will be able to display the title of the object when it's returned in the list of search results.
The catalog stores information from an object by going through each defined index and meta data name. If the object either has a property or a method with the same name, that information is taken from the object and stored in the catalog. When searching on the catalog, a special result object is returned for each match, not the object that was cataloged (since it is never actually stored in the catalog). These are two very important things to understand and remember.
One discovery I made (maybe it's documented somewhere other than
the source code, I don't know), is that the unique ID under which an
object is cataloged is used to construct the URL returned by the result
object's getURL() method. This comes in very handy when
cataloging things that are not in the object database. You can catalog
an object that contains the appropriate index and meta data
information, but use your own uid to construct the URL
you want used when the object shows up in the search results. This
should make more sense as you follow the examples below.
Setting Up the Catalog
The first step in any search is to set up the catalog. There is no difference in setting up a catalog for "normal" searches and setting one up for a search of SQL data, so I'm not going to go into detail here on how to do that. Just make sure you have created a catalog with the desired indexes and meta data.
Cataloging SQL Data
There are two main steps in cataloging the SQL data: 1) setup a
ZSQL Method that returns the fields you wish to store in indexes and/or
meta data, and 2) create a script that calls catalog_object() on each record returned by the ZSQL Method. These two steps are described in detail below.
Creating a ZSQL Method
Before you can catalog SQL data, you must be able to tell the catalog what data you want it to store. You do this by creating a ZSQL Method that returns records whose field names correspond to the indexes and/or meta data names in the catalog.
For example, the SQL table Books contains the following fields:
- SQL Fields:
Number, Title, Author, Description, Price
The catalog contains the following indexes and meta data:
- Indexes:
PrincipiaSearchSource, bobobase_modification_time, id, meta_type, title - Meta data:
bobobase_modification_time, id, meta_type, summary, title
If you want the PrincipiaSearchSource index to contain the Description, Title, and Author SQL fields, the summary meta data to contain the Description, and the book Number to be the id, you could use the following ZSQL Method (named getBooksToCatalog):
SELECT Number, Title,
concat(Description, Title, Author) as PrincipiaSearchSource,
'Book' as meta_type, Number as id, Description as summary
FROM Books
You now have a ZSQL Method that will return a list of all of the
books using field names that match the indexes and meta data of the
catalog. I use Book as the meta_type so that I can easily
know in the search results that the object refers to a book. It's a
nice feature for me, but it isn't necessary for this to work. This
illustrates the point that you can provide static information to be
stored in the catalog by including it in your query with the desired
meta data name.
Creating a Script to Catalog the Objects
Now that you have a list of records you want to store, you need to
create a script that will iterate over the list and add the information
to the catalog. This is easily done with a Python Script. This script
is created in the catalog object itself. If you place it anywhere else
(it really doesn't matter where), you will need to modify it so that it
can find the catalog. Since this script is in the catalog, you can just
use the bound variable container to access it. The following script (named catalogBooks) will do the job::
for book in container.getBooksToCatalog():
container.catalog_object(book,
'/Publications/getBook/+book.Number+/bookdetails.html')
print Book # + book.Number
return printed
This iterates over the records returned by the ZSQL Method created above, and calls catalog_object()
on each. This is where most of the magic happens. Each book number that
is cataloged will be printed, one per line...nothing fancy. You don't
need the print statements, but it gives some reassuring feedback.
catalog_object() takes two parameters: 1) the object to be cataloged, and 2) a unique ID (uid). The object is a record from the ZSQL Method, and as mentioned above, the uid is used to form the URL that will be returned by the result object's getURL() method. The use of the uid will be explained in detail in the next section.
All you need to do now is execute the script and the data will be
put in the catalog. You can run the script either through the Test
tab in the management interface, or by entering its URL in your
browser. Either way will work the same. You can browse through the
catalog to verify that the indexes and meta data have been filled in as
expected.
The Result Object's URL
A catalog search returns a list of matching result records. As with
creating the catalog, there is no difference in how you search a
catalog with SQL data in it. The difference is in the result object's
URL. The URL as returned by getURL() does not reference an actual object in the object database. Instead, the uid used to catalog the SQL data will cause getURL() to return URLs like http://wherever/Publications/getBook/00123/bookdetails.html. This URL refers to a ZSQL Method getBook and a DTML Method bookdetails.html,
that when used together will display the information on that book. This
is actually a feature of Zope and ZSQL Methods. It has nothing to do
with the catalog, but it's included here to illustrate the use of the uid.
The folder Publications contains the methods for displaying the information on a book:
getBookhas the single argumentnumberand has the advanced setting Allow "simple" direct traversal checked. The SQL query is as follows:
SELECT * FROM Books WHERE
bookdetails.htmlhasdtml-varstatements that display the fields of the result. For example (a very plain example):<dtml-var name="standard_html_header">
<TABLE><TR>
<TD>Book Number</TD><TD><dtml-var Number></TD>
<TD>Description</TD><TD><dtml-var Description></TD>
</TR></TABLE>
<dtml-var name="standard_html_footer">
If you provide links using getURL() on the results page of a search, you can use whatever uid suits you for creating URLs that will displaying the SQL data.
Cataloging Just About Anything
Using a process similar to that above, you can catalog just about
anything. As demonstrated above, it doesn't actually have to be an
object in the Zope object database. It's a virtual object. If the
object you pass to catalog_object() has properties or
methods with the same name as the indexes and/or meta data of your
catalog, you can add it to the catalog. Note that it is not necessary
to provide all index and meta data information in object to be
cataloged. If an object doesn't have a particular property or method
with the same name as an index/meta data, it is ignored and left blank
in the catalog. The key to allowing the user to view the virtual object
is in the uid used when cataloging the information. You just need to construct a uid (and thus a valid URL) that will display the object.
If this doesn't make sense, the following example will hopefully help.
Cataloging Virtual Objects Example
What Is Already There
In this example, there are a set of articles in PDF format that can
be viewed online. Each article is listed in the SQL table Articles with the following fields:
- SQL Fields:
Number, Title, Author, Description, Filename
The first four fields should be self explanatory. The Filename
field contains the filename of the PDF document associated with the
article. The PDF documents are stored on the file system in a directory
accessible via a URL such as http://wherever/pdfdocs/filename.pdf. They
are not stored in the Zope object database.
The site contains the ZSQL Method getArticle and the DTML Method articledetails.html
that show the article information stored in the SQL table (similar to
what was used in the book example above). This "preview" of the article
also includes a link to the PDF document so that it can be downloaded
or viewed through a browser plug-in.
getArticle is as follows:
SELECT * FROM Articles WHERE <dtml-sqltest number column="Number" type=nb>
A very plain version of articledetails.html might is as follows:
<dtml-var name="standard_html_header">
<TABLE><TR>
<TD>Article Number</TD><TD><dtml-var Number></TD>
<TD>Description</TD><TD><dtml-var Description></TD>
<TD>PDF</TD><TD><A HREF="/pdfdocs/&dtml-Filename;">View/Download</A></TD>
</TR></TABLE>
<dtml-var name="standard_html_footer">
Setting Up the Search
The goal is to be able to search the text of the article, but have the results direct the user to the article display page above where they can then view the PDF document.
First, create a catalog that contains the following indexes and meta data:
- Indexes:
PrincipiaSearchSource, bobobase_modification_time, id, meta_type, title - Meta data:
bobobase_modification_time, id, meta_type, summary, title
To get a list of the articles to catalog, start with the following ZSQL Method 'getArticlesToCatalog':
SELECT Number, Title, 'Article' as meta_type, Filename
FROM Books
This gives you a list of the articles, but in order to search the
text of the PDF document, the text needs to be stored in the PrincipiaSearchSource
index. Since the document source is not in the database, you can't just
return it using the ZSQL Method as you did with the book descriptions
above.
This where it gets exciting. You need to get the records from the
ZSQL Method results to return the text of the PDF document as the PrincipiaSearchSource.
To do this, you need to use an advanced feature of ZSQL Methods to
assign the record to a class. But first you must create the class.
Creating the ZSQL Record Class
Since Zope cannot read PDF documents, and I don't know of any Python modules that will do it, you must first convert the PDF documents into corresponding text files so that Zope can read the article text. There are a number of tools you can use to do this very quickly and with little effort.
Next, create a file sqlrecord.py in the Extensions directory of your Zope installation that contains the following source code::
from string import split
class Article: """Class used by ZSQL for indexing articles"""
def id(self): """Use article file as id"""
return split(self.Filename, .)[0]
def PrincipiaSearchSource(self): """Read article text"""
# Get the .txt version of the .pdf filename
basename = split(self.Filename, '.)[0]
filename = /path-to-docs/pdfdocs/+basename+.txt'
try:
fp = open(filename, r)
except IOError:
return ''
text = fp.read() fp.close() return text
You will need to modify the path information to point to the directory where the converted .txt files are.
Now go to the Advanced tab of the getArticlesToCatalog ZSQL Method and enter Article as the Class Name and sqlrecord as the Class File. The Article class will now provide the methods id() and PrincipiaSearchSource() that can be used by the ZSQL Method to provide information to the catalog.
The final step is to create the Python Script the iterates over each ZSQL record and passes it to catalog_object(). The following script (named catalogArticles) will do the job. As with the book example, if you do not place this script in the catalog object itself, you must modify container to be your catalog:
for article in container.getArticlesToCatalog():
container.catalog_object(article,
'/Publications/getArticle/'+article.Number+'/articledetails.html')
print 'Article #' + article.Number
return printed
When you run this script, Zope will use getArticlesToCatalog,
along with the associated class methods, to store the article
information and searchable text in the catalog. As you can see from the
uid in the catalogArticles script, the search
result object will provide URLs that point to the "preview" of the
article described above, where the user can then select to view the PDF
article. You could also provide a uid that points directly to the PDF file if you wanted to skip the preview.
Summary
You can see by example that the possibilities of what can be stored in the catalog are great. Although both examples used SQL data, there's no reason you can't use an object that gets the index and meta data information from the Zope object database, other external files, remote data on other servers, or any combination of the above. If you can create an object that gathers the appropriate information, you can stick it in the catalog.
Questions/Comments
I hope this How-To has been helpful. If you have any questions, comments, or information to improve this process, please let me know.
