FIFFS - Fiffs Intelligent Feed Filter System
============================================
FIFFS is a massive data mining, classification, ai, coffe cooking system.
It adapts to your newsfeed reading habits to classify and rate news.
This is a basic installation document, it is not very userfriendly and
will get better in the future.
REQUIREMENTS
============
At the moment:
Perl
Apache
MySQL 4.1 (every other version doesn't do UTF-8 right)
Plus, of course, a couple of Perl Modules, namely:
- AI::Categorizer
- Class::DBI
- Class::DBI::AbstractSearch
- Class::Accessor
- CGI::Application
- CGI::Application::Plugin::Session
- DateTime
- DateTime::Format::ISO8601
- DateTime::Format::Mail
- DateTime::Format::MySQL
- Digest::SHA
- File::Cache
- Graph
- HTML::TagFilter
- Lingua::Identify
- Lingua::Stem
- Lingua::StopWords
- LWP::Simple
- LWP::UserAgent::WithCache
- Module::Pluggable
- POE
- POE::Component::Client::HTTP
- Sparse::Vector
- SQL::Translator
- Text::Compare
- Template
- Test::utf8
- Time::Duration
- XML::OPML
- XML::OPML::SimpleGen
- XML::RSS
- XML::RSS::FromAtom
- XML::Atom::Syndication
- XML::LibXML
In general, the plan is to get away or aditionally provide a
standalone application based on wxPerl and SQLite.
You can install the needed modules by using your local CPAN shell
which you can invoke by typing "cpan" in a console (If this shortcut
is not installed you can get the same effect by typing perl -MCPAN -eshell)
You can tell it to install a module by typing "install XML::RSS" (for example).
An even easier way is to call
sudo perl -MCPAN -ne 'if (/^\t- ([\w:]+)$/) { install "$1"; }' README
which does nothing more than parsing this file and installing all listed
modules.
SETUP
=====
The lib/FIFFS/Config.pm file might need some changes, for example the
directories and the db related stuff.
To generate the admin password (which is a sha512 encrypted word), you
can do "echo -n "password" | shasum -a 512" if Digest::SHA is installed.
The bin/ directory of the distribution contains a script to
create the tables (the database is specified in FIFFS::Config).
cd bin
./gendb.pl
The www/ directory contains all the files that are needed to run the
current Apache/CGI based GUI. You can directly tell your Apache to use
this directoy:
ScriptAlias /fiffs/cgi/ "/home/marcus/scripts/fiffs/www/cgi/"
AllowOverride None
Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
Order allow,deny
Allow from all
Alias /fiffs/ "/home/marcus/scripts/fiffs/www/"
Options Indexes MultiViews
AllowOverride None
Order allow,deny
Allow from all
The bin/directory contains some helper scripts and the classification
stuff, the one thing you should add to your cront daemon is the
update-feeds.pl script (make sure that it runs correctly) and then do a
"crontab -e":
20,40,00 * * * * cd $HOME/scripts/fiffs/bin/; ./update-feeds.pl
You might want to run bin/init-raters.pl on a weekly basis because it
trains the Bayes filter which rates the stories based on your reading
habits.
Another thing you might want to do is to tell your cron.daily (usually
under /etc/cron.daily) to run update.pl on a daily basis. This script
does the clustering, therefore it is essential.
Then you can point your browser to the site (e.g.
http://localhost/fiffs/cgi/admin.pl) and add feeds. Then you might
want to run the bin/init-raters.pl and bin/update.pl script by hand.
The first time usually needs some time (depending on the number of
newsfeeds and your processor power) but it shouldn't be that long.
Afterwards you can start using FIFFS by pointing your browser to
http://localhost/fiffs/cgi/index.pl.
!!! NOTE:
You have to run init-raters.pl before you do something
like update-feeds.pl or update-feeds-poe.pl. Otherwise
you will get an error.
TROUBLESHOOTING
===============
There are many factors influencing the behaviour of the system but
in general there is an easy rule... it will get better with time.
This is due to the following reasons:
- feeds often don't contain publishing dates, so a newly
added feed has only new stories which leads to more displayed
stories, longer clustering times and worse rating results
- rating is done based on the stories you read, so the system
first has to get a reasonable amount of read stories to know
what you like
- feed and cluster ordering is done on a basis of a rating
system which is based on the above two values (rating and
publishing date), so give this some time to adapt.
So give it some time (a week) and then file a bug report :-)
Error solutions:
admin.pl: Use of uninitialized value in join or string at
/usr/share/perl/5.8/File/Spec/Unix.pm line 36.
check your BASEDIR in FIFFS::Config.
Error executing run mode 'mode_add_feed': Undefined subroutine
&Text::Ngram::_process_buffer called at
/usr/local/lib/perl/5.8.4/Text/Ngram.pm line 163.
DOCUMENTATION
=============
You're currently reading it :-)
There are two other files, README.bin which tells you which file in
bin/ does what and README.files, which tells you where to look for
stuff. Anything else will slowly grow.
classify-tc.pl
==============
Classify has two modes, a speed mode which needs quite a lot of memory
and a slower mode which doesn't need much memory at all.
The --disk option enables the slower mode (because the data is stored
on disk rather than in memory)
To give some numbers:
Doing 555600 comparisons in speed mode:
real 39m31.856s
user 35m53.298s
sys 0m6.326s
Doing 555600 lookups in speed mode:
real 5m16.446s
user 4m16.895s
sys 0m2.073s
Doing comparisons 544807 in disk mode:
real 50m7.428s
user 43m33.143s
sys 1m45.845s
Doing lookups 544807 in disk mode:
real 11m42.467s
user 9m39.531s
sys 0m54.665s
Lookups means that data is taken purely from cash, comparisons mean
that texts are compared. The normal usecase is a combination of both.
LICENCE
======
This program is free software; you can redistribute it and/or modify
it under the terms of either:
a) the GNU General Public License as published by the Free Software
Foundation; either version 1, or (at your option) any later
version, or
b) the "Artistic License" which comes with Perl.
On Debian GNU/Linux systems, the complete text of the GNU General
Public License can be found in `/usr/share/common-licenses/GPL' and
the Artistic Licence in `/usr/share/common-licenses/Artistic'.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Have fun,
Marcus
for any questions, inquieries, bug reports, money donations feel free
to contact me under marcus@thiesen.org