Archive

Archive for February, 2005

ht://Dig vs mnoGoSearch Comparison

February 14, 2005 Leave a comment
This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/209). Since then, other massive 
players like Lucene and Xapian have been considered as open-source search 
engines, but were not included yet into this comparison.

Introduction

This article is an attempt to compare objectively the two wonderful Open Source Indexation Tools that are ht://Dig and mnoGoSearch.

The criteria of comparison will mainly be aimed at what one can do on a Linux web server with Open Source database systems such as PostgreSQL and MySQL, with PHP bindings if available. This is because this research isn’t funded by anyone and I have a need of one of these products for, in this case, the development of a search tool for Dokeos, an e-learning management system developed in PHP+MySQL.

Here, I will mix (intentionally) the terms search tool and indexing tool, because indeed those products do both, even if the main effort is done on indexing. I know the terms are different, but the distinction doesn’t make sense here, and it’s easier for non-technical people to understand search tool than indexing tool.

The indexing system will probably work with the help of command-line parsers. Those parsers generally only work on Linux systems, so the Dokeos system would need to keep this indexing tool as a plugin, so as not to force the user to use a Linux server. This only affects the server anyway, and the Dokeos system can still be used from Windows computers.

For a more extensive list of search tools, you might want to visit searchtools.com which has quite an impressive list… In fact, the pages there for ht://Dig and mnoGoSearch are really interesting.

General project aspect

At first glance, we could say that the website is the reflect of the overall quality of the product. More than just finding if the website is crappy or well-organised (which is the reflection of the mind of the developers), you can find documentation, changelogs, and many other interesting stuff there. Let’s see what we can extract from this part

Information type ht://Dig mnoGoSearch
Design OK Bad
Last changelog date January 2002 January 2005
Last release date June 2004 January 2005
Finding information easily Yes Yes
Last documenation update 2002? December 2004 (release 3.2.26)
References to “clients” Around 500 (impressive!) Around 200 (including MySQL and Debian!)

Another interesting point is the portability of the system. Both systems are only needed on the server, so only the servers OS is important.

Operating system ht://Dig mnoGoSearch
Win32 via CygWin – native but not free
Linux yes yes
Digital Unix yes yes
FreeBSD yes yes
OpenBSD no yes
HP UX yes yes
Solaris yes yes
Irix yes yes
SunOS yes yes
Mac OS X yes yes
Mac OS 9 yes no
BSDI no yes
SCO Unixware no yes
AIX no yes

As for the mailing-lists, it appears to me as if mnoGoSearch was trying to give value to its support contracts by being very slow to answer help messages. The result is pretty poor, as you need around a week to get an answer for pretty much everything, and the documentation is not really good (although pretty much up to date). On the other side, the ht://Dig support seems faster, but then it always seems like only one person gives answers, and sometimes no answers are given either. As for mailing-list support, I would say both projects suck, but mnoGoSearch offers official commercial support, which might be a good thing for companies looking into the product.

As for numbers, in the same period of 45 days observation, we have 134 mails in the htdig-general mailing-linst whereas mnogosearch-general comes up to almost 400.

Web and offline “crawling”

Both systems were made to crawl web pages and to extract info from them. This is not possible when a website is password-protected. Unless you use mnoGoSearch to pass the Basic Authentication method, which is not often used to protect a website nowadays.

To get into a, say, password-protected PHP website, you will need to write a PHP page (very well protected) that will log you in and initiate a session for you  [1]

While mnoGoSearch offers offline and database crawling, the documentation lacks too, and if you have a little sample given away with the program code, it is not really helpful because you still need to have web pages the users can use with a particular ID to get the document you indexed offline. There is also a lack of documentation here about how to use multiple offline searches on the same server.

The document types

Both search tools were first aimed at indexing HTML documents. The respective websites of these product are clear as for their title. They are Internet search tools. As far as Dokeos is concerned, this is not enough. Courses are composed of a mix of HTML pages and documents, which can be whatever the user decides. The more document types the search tool can index, the better.

The following table is build as accurately as can be from the documentation of these products as of the writing of this article. Although the search tools only index text and HTML files, they allow the usage of parsers (see here for mnoGoSearch and here for ht://Dig) that will convert other formats to text or HTML files, which can then be indexed.

Document type ht://Dig mnoGoSearch
HTML yes yes
Plain text yes yes
MS-Word yes, via catdoc yes, via catdoc
PDF yes, via xpdf or pdf2text (pdfinfo also extracts meta info) yes, via pdf2txt
PostScript yes, via ps2ascii yes, via ps2ascii
Man pages probably via parser yes, via deroff
RPM packages probably via parser yes, via rpminfo
MS-Excel probably via parser yes, via xls2csv
RTF probably via parser yes, via unrtf
MS-PowerPoint yes, via ppt2html yes, via ppt2html
Database content yes, via web frontend :-( yes, via htdb features 2
Website needing HTTP context (scripts) ? yes
HTTPS ? yes
Others… yes, via appropriate parser yes, via appropriate parser

Indexing methods

Indexing uses several kinds of possible algorithms. Generally speaking, it is the users’ role to choose which algorithm(s) will be used. This is a list of supported algorithms types:

Algorithm types ht://Dig mnoGoSearch
Stemming yes yes
Soundex yes
Fuzzy yes
Synonyms yes
Substrings yes
Thesaurus based indexation no no

And although they both seem to support boolean search, it might be somewhat limited as for latest versions

Boolean operator ht://Dig mnoGoSearch
AND yes yes
OR yes yes
NOT no yes
GROUP no yes
Phrase matching no

Language handling

An important matter in the current case is the handling of the different languages. Dokeos (the company) has its headquarters in Belgium, which is a trilingual country (french, dutch, german) and uses a lot of english. But the Dokeos product is Open Source and we have heard (and have already a translation for it) that Japanese users exist. So the multilingual aspect is important, for the database data as well as the documents and HTML pages. There couldn’t be a clear table to compare the two products here.

ht://Dig offers multilingual support based on a configuration file and ispell dictionnaries. However, this is only for 8-bits characters encoding, so Japanese and Chinese cannot be handled (see report).

mnoGoSearch offers multilingual support based on a configuration file, but the indexation of other-than-english documents is badly documented, so we don’t know what it’s based on. However, the recommended encoding for multilingual support is Unicode, and as such it supports all characters. The language detection is supposed to be automatic, based on words contained in the documents… let us see if we can get more info about this somewhere. The tool uses ispell as a spell checker, following searchtools.com.

Search interfaces

I will only be able to talk about mnoGoSearch here for now, because I haven’t tested the ht://Dig interface yet.

The mnoGoSearch project offers a bundled CGI search interface as default. This interface is nice and easy, and works with template files so you can modify them at will (almost). But sometimes (as in this case), you need to integrate this search interface into another project, for example a PHP application.

mnoGosearch also provides a PHP interface that you need to install (not so easily) on your system to be able to use PHP extension functions directly from your PHP project.

Let’s consider this option in more details

  • the install

The first word that comes to my mind here is “brainsquishing”. Why does an install need to have the MySQL and PHP sources and recompile everything? Why isn’t there an easy way to download some scripts, move them to the right place, and start to play? Apparently, this was the case previously and it stopped because integrating the mnoGoSearch functions into a C compiled extension to PHP was more efficient.

Well, for me, who just want a fast, integrable solution, it’s just a bad point. Not only do you need to compile these, but you also need to have the MySQL and PHP sources at hand for this compile process. Most user that want to try it on a outsourced server where they don’t have near-admin permissions will be unable to use that…

  • the config

I did not get out of this compilation step, so I will write something here when I have.

Other considerations

ht://Dig (or its rundig program) seems to rebuild the index database from scratch each time it runs. Written in C++.

mnoGoSearch only updates those indexes that are bound to modified sources (but does it work for database sources?). Uses database systems for creating index tables and can handle a dozen of different DBMS. Implements servers clustering and mirroring. Can do basic authentication if the webserver requires it and the settings are set for it. Supports weighting for pages structure. Supports compressed gzip format. Written in C.

Both systems can do search result highlighting.

[1] But mnoGoSearch (what about ht://Dig?) before version 3.2.34 didn’t support cookies, so you had to configure PHP to use trans_sid, which meant the PHPSESSID was given in the URL. But you didn’t want this PHPSESSID to appear in the index database, as users that would have clicked the link would have automatically been connected (depending on the session lifetime). So you had to implement something called ReverseAlias to get rid of this URL part before it got into the database. Things then became very hard to do as there was a big lack of documentation on how to use this in combination with a list of servers kept in a database table.

rsync

February 3, 2005 1 comment
This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/205).

rsync is a synchronization tool designed with powerful algorithms to only transfer the minimum needed between source and destination.

Quoting the above web page: rsync is an open source utility that provides fast incremental file transfer.

References:

For a basic explanation of the rsync algorithm, go here

Rsyncable gzip

February 3, 2005 4 comments
This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/206).

GZIP="--rsyncable" tar zcvf toto.tar.gz /toto

Why do you need this special option ?

Because if you compress your files before synchronising them with rsync, a very small change in one original file may force rsync to re-transmit the whole compressed tar.gz file, instead of just the changed portion.

The basic reason is that rsync works at the byte level : very roughly, it compares the old copy of the file with the latest source, and transmits every byte that is different to update the old copy and make it identical to the new. rsync uses a smart way of doing these comparisons, so that in most cases only a tiny portion of the file needs to be actually transmitted.

Unfortunately, file compression algorithms which use an adaptative compression method (like most do), defeat the rsync logic and can cause the whole file to be retransmitted, even if only one byte has been changed.

Why is that so ?

An adaptative compression method uses an analysis of the bytes already processed, to determine how best to compress the following bytes of the file. For example, suppose the compression program starts at byte 0 with a certain compression method. After 1000 bytes have been compressed, the program will recalculate a new compression method, based on what it found in bytes 0-999. It will then insert a new compression table into the file, and use this table to compress the next 1000 bytes. Then it recalculates it’s compression table based on the bytes 0-1999, and does the same, and so on. This means that a change of one byte in bytes 0-999, can potentially change the compression method for the rest of the file, and that the rest of the output bytes will be totally different. And because rsync compares the files byte per byte, it will not find any similar block of bytes between the old and new file, thus will be forced to resend the whole new compressed file.

The --rsyncable option above fixes this problem. With this option, gzip will regularly “reset” his compression algorithm to what it was at the beginning of the file. So if for example there was a change at byte 23, this change will only affect the output up to maximum (for example) byte #9999. Then gzip will restart ‘at zero’, and the rest of the compressed output will be the same as what it was without the changed byte 23. This means that rsync will now be able to re-synchronise between the old and new compressed file, and can then avoid sending the portions of the file that were unmodified.

Now, for the example above, suppose “/toto” is a directory with plenty of small files for a total of 50 MB, thus the uncompressed tar file would be about 50 MB. By compressing it with gzip, we bring this down to 15 MB in the tar.gz file. Now we ‘rsync’ this file with a remote system.

If nothing has changed since yesterday in the /toto directory, the tar.gz file will be the same as yesterday, rsync will detect this and the file will not be transmitted.

On the other hand, if one single small file at the beginning of the ‘tar’ has changed, then without the --rsyncable option, most of the tar.gz file will be different, and rsync will have to transmit almost 15 MB to the remote rsync target system. In that case, it would have been better to not compress the tar file at all !

With the --rsyncable option, it is possible that only 1000 bytes would be different in the tar.gz file, so only 1000 bytes would be transmitted by rsync, for the same end-result.

References :

For an rsync intro, see here

For a full explanation (and only for Real Programmers), see here

There is also a good summary of the whole rsync/gzip/debian situation here

Categories: English, Tech Crunch Tags:

Bash error message: {bad interpreter: No such file or directory}

February 2, 2005 Leave a comment
This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/203).

If you encounter the following error while trying to execute a shell script with Bash, it is probably a carriage return problem: bad interpreter: No such file or directory

To fix it, see HOWTO Convert carriage returns between UNIX and DOS on Debian.

Categories: English, Tech Crunch Tags: ,

HOWTO Convert carriage returns between UNIX and DOS on Debian

February 2, 2005 4 comments
This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/203).

The convert carriage returns between UNIX- and DOS-kind of CR, use the tools dos2unix and unix2dos from the sysutils package.

Usage is really simple:

$ dos2unix filename

or

$ unix2dos filename

where filename is the name of the file to convert.

To convert a hierarchy of files starting from current directory:

$ find . -type f -exec dos2unix {} \;

For more information see the manpages.

%d bloggers like this: