[Reader-list] Report for May

Fri Jun 3 11:37:54 IST 2005

        NewsRack: Automating News Gathering and Classification
        ------------------------------------------------------

Here is my report for the month of May.

Bug fixes
---------
I have made several minor bug fixes and the software is now more stable.
The Sarai installation has now been running for the over 3 weeks without
crashing, or having to be restarted.

New developers 
--------------
A couple of students (undergraduate, masters) are now considering working on
NewsRack for their projects under the joint guidance of Prof.Om Damani of
IIT-Bombay.  I have been working on creating suitable workable projects for
them to work on, based on their background, skills, and time commitment.
One of these students will work for about 1.5 months, and the masters student
will work for about an year.

Current development
------------------- 
I have begun the process of making NewsRack usable with news sites that
do not provide RSS feeds (Hindu, Deccan Herald, Hindustan Times, for ex).
The algorithm itself is straightforward and is outlined below and is
essentially a spidering process.
  1. Download main URL
  2. Identify Base HREF, if any
  3. Identify all unique relative URLs in the page -- but weed out
     links that point to images.
  4. Construct a list of URLs to follow using the Base HREF and the relative
     URLs found in the page.
  5. Recursively follow the links found in 4 above by repeating steps 2-4
     for every candidate link.
By this process, the title, date, and URL links can be identify for every
possible news item.  There are (of course) problems with this approach
which I am going to list later on.

I have begun this experiment by writing Perl scripts to fine-tune the
algorithm.  I now have a working Perl script that successfully downloads
the day's news links for Hindu, Deccan Herald, and The Telegraph (of those
newspapers that I have tried).  But, it doesn't work well for Hindustan Times,
Indian Express, or Times of India.  On a little further investigation,
it can be noticed that Hindu, Deccan Herald, and Telegraph organize
their news as: http://<ROOT>/<DATE>/<STORY-LINK> (or some such variation).
Since only relative links are followed, only links relative to
"http://<ROOT>/<DATE>" are followed and hence only news for that day is
downloaded.  But, ToI, HT, and IE all organize their websites as
database-driven sites.  So, this strategy does not quite work.  So, I am
now investigating other ideas, including examining time stamps of the
published news items and only downloading news published in the last 24
hours or so. 

In addition, there is still the problem of sifting through downloaded files
to identify which files are actually news stories.  For example, some of the
downloaded files are page indexes and headline pages.  It is unclear how to
do this without some sort of newspaper-specific hacks -- for example, all
Hindu stories are stored as http://<ROOT>/<DATE>/stories/<LINK> so, I could
discard all other links.  But, there is no obvious generic solution to this
problem at this time.

New Interest
------------
There is continuing interest in this tool.  When possible, I am now trying
to meet with individuals and groups and help them with the process of
setting up a profile, since the tool still requires a fair bit of initial
time investment.