FIFFS - Fiffs Intelligent Feed Filter System ============================================ FIFFS is a massive data mining, classification, ai, coffe cooking system. It adapts to your newsfeed reading habits to classify and rate news. This is a basic installation document, it is not very userfriendly and will get better in the future. REQUIREMENTS ============ At the moment: Perl Apache MySQL 4.1 (every other version doesn't do UTF-8 right) Plus, of course, a couple of Perl Modules, namely: - AI::Categorizer - Class::DBI - Class::DBI::AbstractSearch - Class::Accessor - CGI::Application - CGI::Application::Plugin::Session - DateTime - DateTime::Format::ISO8601 - DateTime::Format::Mail - DateTime::Format::MySQL - Digest::SHA - File::Cache - Graph - HTML::TagFilter - Lingua::Identify - Lingua::Stem - Lingua::StopWords - LWP::Simple - LWP::UserAgent::WithCache - Module::Pluggable - POE - POE::Component::Client::HTTP - Sparse::Vector - SQL::Translator - Text::Compare - Template - Test::utf8 - Time::Duration - XML::OPML - XML::OPML::SimpleGen - XML::RSS - XML::RSS::FromAtom - XML::Atom::Syndication - XML::LibXML In general, the plan is to get away or aditionally provide a standalone application based on wxPerl and SQLite. You can install the needed modules by using your local CPAN shell which you can invoke by typing "cpan" in a console (If this shortcut is not installed you can get the same effect by typing perl -MCPAN -eshell) You can tell it to install a module by typing "install XML::RSS" (for example). An even easier way is to call sudo perl -MCPAN -ne 'if (/^\t- ([\w:]+)$/) { install "$1"; }' README which does nothing more than parsing this file and installing all listed modules. SETUP ===== The lib/FIFFS/Config.pm file might need some changes, for example the directories and the db related stuff. To generate the admin password (which is a sha512 encrypted word), you can do "echo -n "password" | shasum -a 512" if Digest::SHA is installed. The bin/ directory of the distribution contains a script to create the tables (the database is specified in FIFFS::Config). cd bin ./gendb.pl The www/ directory contains all the files that are needed to run the current Apache/CGI based GUI. You can directly tell your Apache to use this directoy: ScriptAlias /fiffs/cgi/ "/home/marcus/scripts/fiffs/www/cgi/" AllowOverride None Options ExecCGI -MultiViews +SymLinksIfOwnerMatch Order allow,deny Allow from all Alias /fiffs/ "/home/marcus/scripts/fiffs/www/" Options Indexes MultiViews AllowOverride None Order allow,deny Allow from all The bin/directory contains some helper scripts and the classification stuff, the one thing you should add to your cront daemon is the update-feeds.pl script (make sure that it runs correctly) and then do a "crontab -e": 20,40,00 * * * * cd $HOME/scripts/fiffs/bin/; ./update-feeds.pl You might want to run bin/init-raters.pl on a weekly basis because it trains the Bayes filter which rates the stories based on your reading habits. Another thing you might want to do is to tell your cron.daily (usually under /etc/cron.daily) to run update.pl on a daily basis. This script does the clustering, therefore it is essential. Then you can point your browser to the site (e.g. http://localhost/fiffs/cgi/admin.pl) and add feeds. Then you might want to run the bin/init-raters.pl and bin/update.pl script by hand. The first time usually needs some time (depending on the number of newsfeeds and your processor power) but it shouldn't be that long. Afterwards you can start using FIFFS by pointing your browser to http://localhost/fiffs/cgi/index.pl. !!! NOTE: You have to run init-raters.pl before you do something like update-feeds.pl or update-feeds-poe.pl. Otherwise you will get an error. TROUBLESHOOTING =============== There are many factors influencing the behaviour of the system but in general there is an easy rule... it will get better with time. This is due to the following reasons: - feeds often don't contain publishing dates, so a newly added feed has only new stories which leads to more displayed stories, longer clustering times and worse rating results - rating is done based on the stories you read, so the system first has to get a reasonable amount of read stories to know what you like - feed and cluster ordering is done on a basis of a rating system which is based on the above two values (rating and publishing date), so give this some time to adapt. So give it some time (a week) and then file a bug report :-) Error solutions: admin.pl: Use of uninitialized value in join or string at /usr/share/perl/5.8/File/Spec/Unix.pm line 36. check your BASEDIR in FIFFS::Config. Error executing run mode 'mode_add_feed': Undefined subroutine &Text::Ngram::_process_buffer called at /usr/local/lib/perl/5.8.4/Text/Ngram.pm line 163. DOCUMENTATION ============= You're currently reading it :-) There are two other files, README.bin which tells you which file in bin/ does what and README.files, which tells you where to look for stuff. Anything else will slowly grow. classify-tc.pl ============== Classify has two modes, a speed mode which needs quite a lot of memory and a slower mode which doesn't need much memory at all. The --disk option enables the slower mode (because the data is stored on disk rather than in memory) To give some numbers: Doing 555600 comparisons in speed mode: real 39m31.856s user 35m53.298s sys 0m6.326s Doing 555600 lookups in speed mode: real 5m16.446s user 4m16.895s sys 0m2.073s Doing comparisons 544807 in disk mode: real 50m7.428s user 43m33.143s sys 1m45.845s Doing lookups 544807 in disk mode: real 11m42.467s user 9m39.531s sys 0m54.665s Lookups means that data is taken purely from cash, comparisons mean that texts are compared. The normal usecase is a combination of both. LICENCE ====== This program is free software; you can redistribute it and/or modify it under the terms of either: a) the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version, or b) the "Artistic License" which comes with Perl. On Debian GNU/Linux systems, the complete text of the GNU General Public License can be found in `/usr/share/common-licenses/GPL' and the Artistic Licence in `/usr/share/common-licenses/Artistic'. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Have fun, Marcus for any questions, inquieries, bug reports, money donations feel free to contact me under marcus@thiesen.org