Немного детализирую:
при таких настройках
Код: Выделить всё
ok_locales en
ok_languages uk ru en
normalize_charset 0
inactive_languages af am ar be bg bs ca cs cy da de el eo es et eu fa fi fr fy ga gd he hi hr hu hy id is it ja ka ko la lt lv mr ms ne nl no pl pt qu rm ro sa sco sk sl sq sr sv sw ta th tl tr vi yi zh zh.big5 zh.gb2312
textcat_max_languages 5
При тестировании имеем:
Код: Выделить всё
store tmp # spamassassin -D < test_utf8.msg
...
Фев 18 12:18:58.164 [19562] dbg: plugin: Mail::SpamAssassin::Plugin::TextCat=HASH(0xa12b6a8) implements 'extract_metadata', priority 0
Фев 18 12:18:58.164 [19562] dbg: message: ---- MIME PARSER START ----
Фев 18 12:18:58.164 [19562] dbg: message: parsing multipart, got boundary: _----------=_1298017287235347
Фев 18 12:18:58.165 [19562] dbg: message: found part of type text/plain, boundary: _----------=_1298017287235347
Фев 18 12:18:58.165 [19562] dbg: message: added part, type: text/plain
Фев 18 12:18:58.166 [19562] dbg: message: found part of type text/html, boundary: _----------=_1298017287235347
Фев 18 12:18:58.166 [19562] dbg: message: added part, type: text/html
Фев 18 12:18:58.167 [19562] dbg: message: parsing normal part
Фев 18 12:18:58.167 [19562] dbg: message: parsing normal part
Фев 18 12:18:58.167 [19562] dbg: message: ---- MIME PARSER END ----
Фев 18 12:18:58.167 [19562] dbg: message: decoding other encoding type (binary), ignoring
Фев 18 12:18:58.168 [19562] dbg: message: decoding other encoding type (binary), ignoring
Фев 18 12:18:58.181 [19562] dbg: textcat: classifying, skipping: pt ne hi tr sco es da zh.gb2312 sw no lv fr ro vi sa ta id th sr et tl cy ko fi lt hr de be cs yi af bs is qu sl la ga hy ms am ja eu ka mr bg sv zh.big5 rm it he zh hu sq pl eo ca fa fy nl ar gd sk el
Фев 18 12:18:58.257 [19562] dbg: textcat: language possibly: ru.iso-8859-5,uk.koi8-r,ru.koi8-r,ru.windows-1251,en
Фев 18 12:18:58.258 [19562] dbg: textcat: X-Languages: "ru.iso-8859-5 uk.koi8-r ru.koi8-r ru.windows-1251 en", X-Languages-Length: 10000
Фев 18 12:18:58.258 [19562] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x9cb6210) implements 'parsed_metadata', priority 0
...
Фев 18 12:19:02.275 [19562] dbg: rules: ran body rule LOCAL_TEST ======> got hit: "test"
Фев 18 12:19:07.827 [19562] dbg: rules: ran body rule LOCAL_UTF ======> got hit: "порно"
Фев 18 12:19:08.062 [19562] dbg: rules: ran body rule MY_PORNO2 ======> got hit: "порно"
Фев 18 12:19:08.936 [19562] dbg: rules: ran body rule LOCAL_ALL ======> got hit: "порно"
...
Код: Выделить всё
store tmp # spamassassin -D < test_cp1251.msg
...
Фев 18 12:23:43.262 [19613] dbg: plugin: Mail::SpamAssassin::Plugin::TextCat=HASH(0xa589ca8) implements 'extract_metadata', priority 0
Фев 18 12:23:43.262 [19613] dbg: message: ---- MIME PARSER START ----
Фев 18 12:23:43.262 [19613] dbg: message: parsing multipart, got boundary: _----------=_1298017287235347
Фев 18 12:23:43.263 [19613] dbg: message: found part of type text/plain, boundary: _----------=_1298017287235347
Фев 18 12:23:43.264 [19613] dbg: message: added part, type: text/plain
Фев 18 12:23:43.265 [19613] dbg: message: found part of type text/html, boundary: _----------=_1298017287235347
Фев 18 12:23:43.265 [19613] dbg: message: added part, type: text/html
Фев 18 12:23:43.265 [19613] dbg: message: parsing normal part
Фев 18 12:23:43.265 [19613] dbg: message: parsing normal part
Фев 18 12:23:43.265 [19613] dbg: message: ---- MIME PARSER END ----
Фев 18 12:23:43.265 [19613] dbg: message: decoding other encoding type (binary), ignoring
Фев 18 12:23:43.266 [19613] dbg: message: decoding other encoding type (binary), ignoring
Фев 18 12:23:43.278 [19613] dbg: textcat: classifying, skipping: pt ne hi tr sco es da zh.gb2312 sw no lv fr ro vi sa ta id th sr et tl cy ko fi lt hr de be cs yi af bs is qu sl la ga hy ms am ja eu ka mr bg sv zh.big5 rm it he zh hu sq pl eo ca fa fy nl ar gd sk el
Фев 18 12:23:43.346 [19613] dbg: textcat: language possibly: ru.windows-1251
Фев 18 12:23:43.346 [19613] dbg: textcat: X-Languages: "ru.windows-1251", X-Languages-Length: 8331
Фев 18 12:23:43.346 [19613] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0xa114818) implements 'parsed_metadata', priority 0
...
Фев 18 12:23:46.037 [19613] dbg: rules: ran body rule LOCAL_TEST ======> got hit: "test"
Фев 18 12:23:49.214 [19613] dbg: rules: ran body rule LOCAL_WIN ======> got hit: "?????"
...
при
Код: Выделить всё
store tmp # spamassassin -D < test_utf8.msg
...
Фев 18 12:26:53.804 [19722] dbg: plugin: Mail::SpamAssassin::Plugin::TextCat=HASH(0x9a466a8) implements 'extract_metadata', priority 0
Фев 18 12:26:53.804 [19722] dbg: message: ---- MIME PARSER START ----
Фев 18 12:26:53.804 [19722] dbg: message: parsing multipart, got boundary: _----------=_1298017287235347
Фев 18 12:26:53.805 [19722] dbg: message: found part of type text/plain, boundary: _----------=_1298017287235347
Фев 18 12:26:53.805 [19722] dbg: message: added part, type: text/plain
Фев 18 12:26:53.806 [19722] dbg: message: found part of type text/html, boundary: _----------=_1298017287235347
Фев 18 12:26:53.806 [19722] dbg: message: added part, type: text/html
Фев 18 12:26:53.807 [19722] dbg: message: parsing normal part
Фев 18 12:26:53.807 [19722] dbg: message: parsing normal part
Фев 18 12:26:53.807 [19722] dbg: message: ---- MIME PARSER END ----
Фев 18 12:26:53.807 [19722] dbg: message: decoding other encoding type (binary), ignoring
Фев 18 12:26:53.808 [19722] dbg: message: Converting...
Фев 18 12:26:53.809 [19722] dbg: message: decoding other encoding type (binary), ignoring
Фев 18 12:26:53.809 [19722] dbg: message: Converting...
Фев 18 12:26:53.874 [19722] dbg: textcat: classifying, skipping: pt ne hi tr sco es da zh.gb2312 sw no lv fr ro vi sa ta id th sr et tl cy ko fi lt hr de be cs yi af bs is qu sl la ga hy ms am ja eu ka mr bg sv zh.big5 rm it he zh hu sq pl eo ca fa fy nl ar gd sk el
Фев 18 12:26:53.923 [19722] dbg: textcat: language possibly: ru.iso-8859-5,uk.koi8-r,ru.koi8-r,ru.windows-1251,en
Фев 18 12:26:53.923 [19722] dbg: textcat: X-Languages: "ru.iso-8859-5 uk.koi8-r ru.koi8-r ru.windows-1251 en", X-Languages-Length: 10000
Фев 18 12:26:53.923 [19722] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x95d1210) implements 'parsed_metadata', priority 0
...
Фев 18 12:26:57.830 [19722] dbg: rules: ran body rule LOCAL_TEST ======> got hit: "test"
...
Код: Выделить всё
store tmp # spamassassin -D < test_cp1251.msg
...
Фев 18 12:28:38.054 [19791] dbg: plugin: Mail::SpamAssassin::Plugin::TextCat=HASH(0x9769cc0) implements 'extract_metadata', priority 0
Фев 18 12:28:38.054 [19791] dbg: message: ---- MIME PARSER START ----
Фев 18 12:28:38.054 [19791] dbg: message: parsing multipart, got boundary: _----------=_1298017287235347
Фев 18 12:28:38.056 [19791] dbg: message: found part of type text/plain, boundary: _----------=_1298017287235347
Фев 18 12:28:38.056 [19791] dbg: message: added part, type: text/plain
Фев 18 12:28:38.057 [19791] dbg: message: found part of type text/html, boundary: _----------=_1298017287235347
Фев 18 12:28:38.057 [19791] dbg: message: added part, type: text/html
Фев 18 12:28:38.057 [19791] dbg: message: parsing normal part
Фев 18 12:28:38.057 [19791] dbg: message: parsing normal part
Фев 18 12:28:38.057 [19791] dbg: message: ---- MIME PARSER END ----
Фев 18 12:28:38.058 [19791] dbg: message: decoding other encoding type (binary), ignoring
Фев 18 12:28:38.058 [19791] dbg: message: Using labeled charset windows-1251
Фев 18 12:28:38.058 [19791] dbg: message: Converting...
Фев 18 12:28:38.059 [19791] dbg: message: decoding other encoding type (binary), ignoring
Фев 18 12:28:38.059 [19791] dbg: message: Using labeled charset windows-1251
Фев 18 12:28:38.059 [19791] dbg: message: Converting...
Фев 18 12:28:38.123 [19791] dbg: textcat: classifying, skipping: pt ne hi tr sco es da zh.gb2312 sw no lv fr ro vi sa ta id th sr et tl cy ko fi lt hr de be cs yi af bs is qu sl la ga hy ms am ja eu ka mr bg sv zh.big5 rm it he zh hu sq pl eo ca fa fy nl ar gd sk el
Фев 18 12:28:38.172 [19791] dbg: textcat: language possibly: ru.iso-8859-5,uk.koi8-r,ru.koi8-r,ru.windows-1251,en
Фев 18 12:28:38.172 [19791] dbg: textcat: X-Languages: "ru.iso-8859-5 uk.koi8-r ru.koi8-r ru.windows-1251 en", X-Languages-Length: 10000
Фев 18 12:28:38.172 [19791] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x92f4838) implements 'parsed_metadata', priority 0
...
Фев 18 11:31:22.483 [18199] dbg: rules: ran body rule LOCAL_TEST ======> got hit: "test"
...
Пример правила которое тестируется:
Код: Выделить всё
body MY_PORNO2 /порно/i
score MY_PORNO2 1.0
body LOCAL_TEST /test/
score LOCAL_TEST 1.0
body LOCAL_WIN /\xEF\xEE\xF0\xED\xEE/
score LOCAL_WIN 1.0
body LOCAL_UTF /\xd0\xbf\xd0\xbe\xd1\x80\xd0\xbd\xd0\xbe/
score LOCAL_UTF 1.0
body LOCAL_ALL /порно/
score LOCAL_ALL 1.0
Если я правильно понимаю, то при включении "normalize_charset" кодировки перестают определяться. Ето поправить можно?
Локаль
Код: Выделить всё
store spamassassin # locale
LANG=ru_UA.UTF-8
LC_CTYPE="ru_UA.UTF-8"
LC_NUMERIC="ru_UA.UTF-8"
LC_TIME="ru_UA.UTF-8"
LC_COLLATE="ru_UA.UTF-8"
LC_MONETARY="ru_UA.UTF-8"
LC_MESSAGES="ru_UA.UTF-8"
LC_PAPER="ru_UA.UTF-8"
LC_NAME="ru_UA.UTF-8"
LC_ADDRESS="ru_UA.UTF-8"
LC_TELEPHONE="ru_UA.UTF-8"
LC_MEASUREMENT="ru_UA.UTF-8"
LC_IDENTIFICATION="ru_UA.UTF-8"
LC_ALL=
Как то так...