[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[vps-mail] Issues re Spamassassin & Bayes



Scott, I have had the following exchange on the spamassassin
mailing list regarding issues I have had with Bayes.
Because it could affect all VPS users, I am setting forth
everything in full. Basically, the issue appears to be in
the way the VPS handles the database required for SA/Bayes.

I've figured out how to keep things working on my VPS, by
the way - my own hack seems to work, but this may not hold
true with an upgrade to version 2.6.

Note: I determined the version of Berkeley DB by running the
command:
% file ~/.spamassassin/bayes_toks*

Results:  Berkeley DB 1.85 (Hash, version 2, native byte-order)


1) My post:

From: Abigail Marshall <abigail@xxxxx>
Bayes configuration questions  
2003-09-24 20:19

 Here's the issue:
 
 System: Running SA 2.54, FreeBSD Unix, Berkeley DB 1.85
 (Hash, version 2):
 
 Problem: When bayes_toks grows to more than 5K, it becomes
 corrupted during sa-learn and ultimately trashed or lost.
 
 My solution: Set bayes_expiry_max_db_size to lower level to
 force expiry, so that bayes_toks doesn't grow too large.
 
 I did not make changes to configuration to
 bayes_expiry_min_db_size or to bayes_expiry_scan_count
 
 Questions/Problems:
 
 1. Why did bayes_toks grow to more than 5k in the first
 place?
 
 Documentation for SA 2.5x (sa-learn.html) says:
 
 > Once it hits 5000 bytes, the bayes_toks database is
 > locked, and the message counter entry in that database is
 > increased accordingly.
 
 2. What is default configuration for
 bayes_expiry_max_db_size for SA 2.5x and how large should
 the resulting file be?
 
 Through experimentation, I have ended up with a setting:
 
 bayes_expiry_max_db_size        150000
 
 With this setting, bayes_toks never gets any larger than
 2,556 kb.
 
 According to documentation for SA 2.6 (sa-learn.txt)
 
 > "bayes_expiry_max_db_size" specifies both the auto-expire token count
 >  point, as well as the resulting number of tokens after expiry as
 >  described above. The default value is 150,000, which is roughly
 >  equivalent to a 6Mb database file if you're using DB_File.
 
 Note that my setting is the SAME as the default for 2.6 -
 but rather than a 6Mb db file, I end up with a 2.5 Mb file,
 with a Bayes corpus of ~2800 or less.
 
 Documentation for SA 2.5 (sa-learn.html) says:
 
 > bayes_expiry_min_db_size is part of the SpamAssassin
 > configuration. The default value is 100000, which is
 > roughly equivalent to a 5Mb database file if you're using
 > DB_File.
 
 So here is where I am totally confused:  from what I can
 tell, my setting of bayes_expiry_max_db_size=150000 should
 either have no effect whatsoever, or it should leave me with
 a bayes_toks file that will grow to 5K - and I end up with a
 file half that size.
 
 NOTE: Bayes works fine for me this way, but my guess is that
 with the small corpus and short expiry cycle I may see
 erratic performance over time. Because my system seems to
 get much more spam than ham, I end up with a 9:2
 ratio of spam to ham as autolearn continues to feed incoming
 email to the database.
 
 But my main concern right now is trying to figure out why
 my experience doesn't match what the documentation says I
 should expect.
 
 -Abigail
 
2) Response #1

From: David B Funk <dbfunk@xxxxx>
Re: Bayes configuration questions  
2003-09-24 21:59

 On Wed, 24 Sep 2003, Abigail Marshall wrote:
 
 > Here's the issue:
 >
 > System: Running SA 2.54, FreeBSD Unix, Berkeley DB 1.85
 > (Hash, version 2):
 [snip..]
 > -Abigail
 
 Abigail,
 Where the heck did you manage to find V1.85 of the Berkeley DB kit?
 It's ancient and buggy ( I was using it a decade ago with sendmail v6 ;)
 
 Get a reasonably modern version of Berkeley DB, say at least v3.3
 or better.
 
 Berkeley DB should have -no- trouble handling Bayes databases in the
 multi-megabyte size with 100,000'ds of tokens.
 Heck, my bayes_journal file usually runs about 50k bytes. (of course
 that's with 10k-20k messages per day, so it does a 0-100k
 roll-over every 10 minutes).
 
 -- 
 Dave Funk                                  University of Iowa
 <dbfunk (at) engineering.uiowa.edu>        College of Engineering
 319/335-5751   FAX: 319/384-0549           1256 Seamans Center
 Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
 #include <std_disclaimer.h>
 Better is not better, 'standard' is better. B{
 
 
3) Response #2:

Email Archive: spamassassin-talk (read-only)

From: <jm@xxxxx> (Justin Mason)
Re: Bayes configuration questions  
2003-09-24 23:13

 David B Funk writes:
 >On Wed, 24 Sep 2003, Abigail Marshall wrote:
 >
 >> Here's the issue:
 >>
 >> System: Running SA 2.54, FreeBSD Unix, Berkeley DB 1.85
 >> (Hash, version 2):
 >[snip..]
 >> -Abigail
 >
 >Abigail,
 >Where the heck did you manage to find V1.85 of the Berkeley DB kit?
 >It's ancient and buggy ( I was using it a decade ago with sendmail v6 ;)
 
 Aha -- that explains it.  David's right, Berkeley DB 1.85 is
 very old, and has had problems noted before (in other products,
 at least).  IMO that's the most likely cause of the problem...
 
 PS: I was also thinking disk quotas, but this is much more likely.
 
 --j.

======================================================================
Technical questions regarding this list may be sent to
<vps-mail-owner@xxxxxxxxxxxx>. You may request an automated help
response by sending an email with the word 'help' (w/o quotes) in the
BODY of the message (subject is ignored) to <vps-mail-request@xxxxxxxxxxxx>.
======================================================================


Main Index | Thread Index
Match: Format: Sort by:
Search: