[Reader-list] NewsRack update (fwd)

Mon May 2 21:43:25 IST 2005

Resending to reader-list without the attachment since the earlier posting is
waiting in the approval queue.  -Subbu.

---------- Forwarded message ----------
Date: Fri, 29 Apr 2005 14:01:33 +0530 (IST)
From: Subramanya Sastry <sastry at cs.wisc.edu>
To: prc at sarai.net, reader-list at sarai.net
Subject: NewsRack update

Hello everyone,

Beginning April 2005, I have received an extension of the FLOSS Independent
Fellowhip from Sarai to continue working on NewsRack.  I am sending along 2
documents with this email.
(a) the revised proposal I submitted to Sarai in February (enclosed as plain
    text after this foreword).  I have also added an updates section below
    to reflect changes since then.
(b) a work-in-progress design document for NewsRack (PDF attachment)

Update since Feb 2005
- ---------------------
-> I have tested NewsRack on both Tomcat-5.0.28 and Resin-2.1.16.
-> News is automatically updated every 4 hours -- it is no longer necessary
   to log in and click on the 'Download news' link.
-> RSS2.0 feeds are now available for every issue, and for every category
   in every issue.
-> There is also a "New since last download" information which is useful
   when you visit the site regularly (like me) to see what fresh news items
   have been classified in what issues in what categories.
-> Have been working on a design document for further modularising NewsRack
   and enabling independent development along multiple tracks.
-> Updated some of the help files and added an About NewsRack and FAQ on
   the site.

The only publicly deployed version of NewsRack is at:
    http://floss.sarai.net/newsrack

NewsRack has undergone various bug fixes and enhancements (some of which are
described above) in the last few months.  I am thinking of doing an initial
release of this soon on Source Forge where the project is registered.
However, I am also hoping to enlist the participation of new developers in
this project.  That is also the reason for why I am sending along the 2-page
design document -- which captures my current thinking of how NewsRack
should be structured.

If any of you is interested in working on this with me, please get in
touch with me at 'sastry at cs.wisc.edu'.  If you are in Bangalore, we could
meet in person to take it forward.

Subbu.
- ---------------------------------------------------------------------------
         NewsRack: Automating News Gathering and Classification
         ------------------------------------------------------

Abstract:
- ---------
Several organizations in the social development sector monitor news that
is relevant to their work.  This is a time-consuming and laborious process
for some groups, especially when the news is monitored, marked, cut, and
filed using hard copies of newspapers and magazines.  Prior experience
with the press clippings page on www.narmada.org indicates that some of
this work can be automated.  This simplifies the task of news monitoring
and also saves time.

This project attempts to automate news monitoring, and aims to provide
tools for classifying, filing, and long-term archiving of news.  The
project will deliver a tool, called NewsRack, that can be installed, and
will also provide all the same services on a website for those who do not
want to (or cannot) install it.

A basic system is in place at http://floss.sarai.net/newsrack and several
people and organizations (India Together; CED Bangalore; Bank Information
Center Washington-DC; Jagori Delhi) have expressed interest in using it.
However, the tool is still not fully user-friendly, it is missing several
essential features, and work needs to be put in to develop a good user
manual and help system.  I am proposing to continue development of NewsRack
along these lines, and also am hoping to enlist the help of more developers
to assist me.

Overview
- --------
This proposal is organized into the following sections:
1. Project Background
2. User base for Newsrack
3. Current capabilities of NewsRack
4. Brief technical details
5. Technical challenges
6. Further development
7. Timeline
8. Deliverables

- ---------------------
1. Project Background
- ---------------------
I have very close experience with the process of news gathering when
I used to maintain the www.narmada.org website for over 2 years.  I still
have a marginal involvement with maintaining the website.  Updating the
press clippings section on this site was one of the most laborious tasks
initially.  At a later time, using 'wget' scripts to download entire content
of newspapers and 'grep'-ping the content for certain keywords, a lot of
this task was simplified, and the manual work came down to about half-an-hour
a day.  Yet, the entire process has been less than satisfactory as can
be seen by the current breakdown in the process of updating the press
clippings section, several broken links to articles, and lack of any form
of topic-wise/sector-wise classification of articles.

When I spent some time at Environment Support Group, Bangalore, I noticed
that a lot of time used to be spent in this process of collecting news,
marking them, cutting, and filing them.  Besides, there was a perennial
problem of backlog, sometimes resulting in a pile of newspapers that had
to be gone through.

When I visited CED, Bangalore, I found out that they also had a process
for selecting news clippings and filing them.  While they had an electronic
archive, news was added to the electronic archive by downloading articles
(previously marked in the physical version of the newspapers) from the web,
extracting the text content.  There was no automated process here.

Most recently, when I was in Delhi and visited Sarai, I noticed on the
whiteboard something to the effect of: "everyone should spend at least
3 hours every week scanning relevant news clippings and adding it to
the database".  This re-affirmed to me that this is a time consuming
process, and that not everyone wants to do the job, requiring some amount
social coercion to be on top of the process.

When I visited CSE, I found out that at CSE, over 80 news publications
are monitored every day, over 500 news clippings are processed, and that
they have about 5-8 full time employees just for this purpose!  I was
told that "news collection is the pain point of many organizations".
The scale of operations here was quite fascinating.  Every day, the news
that was identified was then abstracted to generate a daily news digest
that was circulated within CSE.  Furthermore, the classified news was used
to generate various monthly digests -- called the Green Files, which was
a collection of most relevant and important environment-related news.
The process of selecting news was done on the basis of a keyword thesaurus
that had about 4800 keywords!  This thesaurus has been developed over a
period of almost 20 years.  The thesaurus is the knowledge-base of CSE's
library operations that aids them in selecting news and classifying them
into the various files based on the particular issue that the article
addressed.  The web-people at CSE were curious to see how the tool develops
and felt that it might be useful -- though they were understandably cautious
to see/judge how well the news classification could be automated and how
much work could be saved.

- -------------------------
2. User base for Newsrack
- -------------------------
>From April 2004 till September 2004, with the help of Sarai's independent
fellowship, I have developed a preliminary version of the tool, and deployed
it at http://floss.sarai.net/newsrack.  Since then, I have received several
expressions of interest whenever I have demo-ed the tool.  In addition,
based on my postings to PRC-list and reader-list, I have had several people
express interest in NewsRack.

India Together (http://www.indiatogether.org), Himanshu Thakkar from
SANDRP Delhi, Manish Bapna from Bank Information Center at Washington DC,
Jagori from New Delhi, CED-Bangalore who have seen various versions of this
tool have expressed their interest in using this tool.

Other friends in informal discussions have also expressed interest in
this tool.  My gut feeling, based on talking with several people, is that
there is definite value, interest, and curiosity about NewsRack.  Its
success and further deployment will depend on:
- how easy it is to develop the knowledge base,
- how easy it is to develop the filtering rules for news gathering,
- and how well the classification performs.

In the last 4 months, I have deployed the tool on Sarai's server, and have
passed the word around to a few people.  In the aftermath of the tsunami
disaster, I also developed a profile for monitoring news related to the
tsunami issue, and to classify news into various categories (like
relief-efforts, news-by-region, death-toll, etc.)

Given existing interest and feedback I have received, I am confident that
NewsRack will be a useful tool to monitor news, among other applications,
once the process of creating profiles is made more friendly.  In this
endeavour, Sarai's support will greatly help me continue development of
this tool.

In addition, I have received suggestions for other potential uses of this
tool.  One potential use is for monitoring media coverage.  NewsRack can
help in the process of monitoring media sources for their news coverage -
what kind of coverage, and how much coverage is a particular issue received
in a particular newspaper, for example.

- -----------------------------------
3. Current capabilities of NewsRack
- -----------------------------------
As it stands today, NewsRack can be deployed as a web-based service with
multiple users using the service or as a standalone tool on one's desktop.
This distinction is somewhat cosmetic, because both uses require a web
server over which NewsRack runs.  In the web-based service incarnation,
the installation will be on a web-server that is accessible on the internet.
However, in a standalone-tool incarnation, the web-server serves pages
locally.

3.1 Collaborative development of filtering rules
- ------------------------------------------------
To use NewsRack, a user has to register with the system.  Once registered,
the user has to create a profile for downloading news and classifying it.
This profile (via filtering rules) tells NewsRack (a) what news sources to
download from (b) what news clippings to select (c) and how to classify the
selected news items.

The filtering rules are done using a 2-step process of
   1. specifying keywords and associating them with concepts, and
   2. using concepts to compose rules.
By using concepts (as opposed to keywords) in filtering rules, concept
definitions can evolve over time (new keywords added, useless keywords
removed, etc.) *without* having to modify the filtering rules themselves.
This also keeps the filtering rules simple and easy to understand.  For
example, a filtering rule for dam-rehabilitation category could be:
              [dam] AND [rehabilitation]
where [dam] and [rehabilitation] are concepts.  If tomorrow, governments
have a new rehabilitation policy which is referred in the newspapers
as NRPD (National Rehabilitation Policy for Dams), one could add the
new keyword "NPRD" to the [rehabilitation] concept without modifying
the filtering rule for the dam-rehabilitation category.  The changes
take effect at all places where this concept is referenced.  This ability
simplifies the maintenance/evolution of the filtering rules over time.

The second interesting feature about NewsRack is that all concept definitions
and filtering rule definitions can be shared and extended.  Thus, a pool of
concepts can be collaboratively developed.  For example, if one user has
defined the concepts [World Bank], [India], [dams], [privatisation],
another user who wants to monitor news about world bank projects in india
can use these concepts without having to redefine them, and if necessary
extend those concepts to suit her needs.

Thus, NewsRack allows for knowledge sharing across users and once a critical
knowledge base is in place, any new user should be able to develop his/her
profile very quickly.

Please go through NewsRack documentation which provides more detailed
documentation about specifying profiles, concepts, and categories.

3.2 Ease of writing filtering rules
- -----------------------------------
Experience and feedback has shown me that while the rule specification
language itself is pretty straightforward, the process is nevertheless
intimidating for someone not familiar with XML.  Based on feedback I have
received so far, I have realized that till such time this process becomes
more friendly for non-technical users, NewsRack cannot substantially
widen its user base.

3.3 RSS vs non-RSS news sources
- -------------------------------
RSS (Rich Site Summary OR RDF Site Summary, depending on the version) is
being widely used to push content from websites to users in contrast to
the earlier model where users visited websites.  With RSS feeds, a user
can subscribe to several news feeds, install a RSS news reader on her
computer, and updates from several websites are available in a single
window without having to visit several different websites.  So, whereas
users visited websites (for website updates) via browsers which understand
HTML, website updates now come to users via RSS readers which understand RSS.
RSS is most pertinent to sites that have frequent updates (like newspapers).

Right now, NewsRack only supports news sources with RSS feeds.  This was
the easiest to develop and test other features of the system.  Once the
current system is deployed, I will work on supporting news sources that
do not provide RSS feeds.  The primary technical challenge here is to
download all news clippings published that day, and to extract date, title,
author information for each clipping.  RSS feeds provide all this information
in a very easy format.

At this time, Indian Express, Rediff, and Times of India provide RSS feeds.
However, Indian Express's feeds only cover the front page items, and not
all published news.  In the future, all newspapers will likely support
RSS feeds.  In the last 6 months, Times of India has started providing
news feeds whereas they were not available when I began this project last
year.

3.4 Archiving of news clippings
- -------------------------------
All selected news clippings are also archived locally.  This archiving is
done by extracting the text-based content of the clipping and stripping
away everything else.  At this time, while this filtering process works
well, the output can be subject to further "beautification" to remove the
still-remaining extraneous text not directly associated the news content.

- --------------------------
4. Brief technical details
- --------------------------
NewsRack has been developed in Java.  The web-service capability has been
developed on top of Servlets technology using the Struts web development
framework (part of Apache's Jakarta project).

4.1 Front-end
- -------------
NewsRack can be installed on any web server that supports Java Servlets.
The current version has been tested on the Resin 2.1.12 web server.  But,
the system should also run on Tomcat -- I will test this soon to verify.

The application has been developed using the Model-View-Controller (MVC)
design pattern.  Thus, there are 3 separate components to NewsRack
1. the application model with definitions of Users, Concepts, Categories,
   Profiles, News Items, etc.
2. the view that provides data presentation and user input, and
3. a controller to dispatch requests and control flow.

I have used the Struts framework of the Apache Jakarta project to implement
this MVC pattern.  Using Struts has simplified the development of the user
interface (the V of MVC).  I have used the Velocity templating engine to
develop the various output screens of NewsRack.

4.2 Back-end
- ------------
In the current implementation of NewsRack, all backend archiving of news,
user profiles (including concept definitions and filtering rules) is done
using XML files over the file system provided by the OS.  Thus, the back end
can be seen as a simple XML database.  The database layer is implemented
as an abstract interface so that in future, other database implementations
like MySQL can be used (which might become necessary as the system evolves).
Ideally, these new backends can be implemented without serious changes to
the other components of NewsRack.

4.3 Specification format for the news filter
- --------------------------------------------
The format for specifying concept definitions, filtering rules, news sources,
and user profiles is based on XML.  Initial feedback from people who had
not heard of XML or those who are non-techies has been that they can easily
write these XML-based profile specifications.  At this time, I have not
provided any GUI for developing this XML specifications.  One could use
any simple text editor (like Notepad on Windows or vi/vim/emacs on Unix)
or other XML editors to develop these specifications.

As discussed in Section 1.1, the filtering rules are written using 2-step
process.  An example here will clarify this process.

      Concept definitions
      -------------------
         <concept>
            <name> dam </name>
            <keyword> dam </keyword>
            <keyword> reservoir </keyword>
            <keyword> mega-dam </keyword>
         </concept>

         <concept>
            <name> narmada </name>
            <keyword> narmada </keyword>
         </concept>

         <concept>
            <name> ssp </name>
            <keyword> sardar sarovar </keyword>
            <keyword> ssp </keyword>
            <keyword> sardar sarovar narmada nigam limited </keyword>
            <keyword> ssnnl </keyword>
         </concept>

      Category definitions
      --------------------
         <category>
            <name> narmada dam </name>
            <rule> narmada AND dam </rule>
         </category>

         <category>
            <name> sardar-sarovar-dam </name>
            <rule> ssp </rule>
         </category>

Thus, the process of modifying concepts is a matter of adding/deleting
or changing existing keywords for that particular concept.  Filtering
rules are simple boolean expressions composed using AND/OR keywords.
Negation support is still sketchy because the semantics of negation
are not clear in this context.  What does it mean to say (NOT dam),
for example?  In addition, context-based qualification is also supported.
For example, "maheshwar" could be the name of a person or a temple or a
place.  However, if the article talks of dams or about river narmada,
mention of "maheshwar" could be a reference to the Maheshwar dam!  Likewise,
a reference to Ms.Roy could be a reference to Arundhati Roy if earlier
in the article, there are references to Arundhati Roy.  Context-based
qualifications attempt to capture these scenarios.

4.4 Implementation of news filtering
- ------------------------------------
For every issue that the user has defined, NewsRack examines all the
filtering rules, and collects all concepts that have been used in the
profile.  NewsRack then generates a lexical analyzer (or scanner) to
recognize the keywords for each concept that has been used.

NewsRack generates a scanner by generating a scanner specification file
for JFlex, a publicly-available Java-based scanner generator.  NewsRack
also supports JavaCC, another publicly-available Java-based scanner
generator.  However, experimentation shows that JFlex generated scanners
are faster and more compact than JavaCC generated scanners.

When a news article is passed through this lexical analyzer, all keywords
that are encountered trigger the corresponding concepts to be recognized.
By analyzing all concepts that are recognized and their frequency, the
news article is then assigned to one or more categories based on the
filtering rules that match.  At this time, the concept analysis and
rule matching algorithm is somewhat rudimentary and it can be refined
and extended over time.

4.5 Support for RSS feeds
- -------------------------
Currently, I am using the publicly available RSS4J Java API for parsing
RSS news feeds.  This has been downloaded from SourceForge.  Over this,
NewsRack implements caching to prevent downloading the same article
repeatedly for different users, and across different sessions.

- ---------------------
5. Ongoing challenges
- ---------------------
When I started this project, I was not familiar with XML, Java Servlets,
Struts, MySQL, JDBC, or with Java-based web applications.  So, quite a bit
of the time in the first 6 months has been spent getting acquainted with
these technologies, experimenting with them, and proceeding with the
development.

5.1 Using XML
- -------------
Some of my lack of experience shows in the XML specification for defining
concepts, rules, news sources, and profiles.  For example, the concepts
defined earlier could become less verbose by using attributes as follows:
            <concept name="dam">
               <keyword val="dam" />
               <keyword val="reservoir" />
               <keyword val="mega-dam" />
            </concept>
While the verbosity of the current specification is not a drawback (I was
told by novice XML users that the attribute-less specification is actually
simpler), future extensions could provide support for these less-verbose
specifications.

In the future, as the system stabilizes, I will migrate the format to these
less-verbose specifications.  In any case, most users are likely to be using
the web-interface to specify filtering rules.

5.2 Developing a web application
- --------------------------------
Initially I started using Servlets and Webmacro to implement the user
interface.  However, I later on switched to using Struts to develop
the user interface.  This decision has helped me develop the initial
system much more quickly than would have been possible otherwise.  But,
the user interface development has proved to be much more difficult and
involved than I had imagined when I first began the project.

While I have implemented the back end news archiving as a simple XML
database, I might have to switch to MySQL (or other databases) at a later
point.  At that time, I will familiarize myself with JDBC on a need-to-know
basis.

5.3 Supporting non-RSS based news sources
- -----------------------------------------
On the one hand, while supporting non-RSS seems as simple as downloading
all content for that day from a newspaper's website, things are more
challenging than this.  If I want to save on download bandwidth, I will
have to do more selective downloading.  But, more importantly, in order
to integrate downloaded news within NewsRack, I have to extract date,
title, author information for each clipping.  While doing this extraction
for any one particular newspaper is a simple matter of writing rules
to recognize these patterns, the harder question is if there is a general
way of extracting this for all newspapers, or if custom patterns would
have to be developed each time a new non-RSS news source is added?  It is
in this respect that RSS acquires added importance.  All this information
is readily available in a RSS feed.  In addition, well-developed RSS
feeds can also provide brief abstracts of the news items which can prove
invaluable in browsing the news archive.

5.4 Managing the news archives
- ------------------------------
Once the tool is up and running for a few weeks, the news collection for
any particular category might continue to grow.  At that time, the challenge
will be in terms of presenting these news items to the user in a way that
does not overwhelm him.  Furthermore, support might have to be provided to
refine the classification system, and reclassify on the fly.

5.5 Bandwidth requirement for downloading news
- ----------------------------------------------
There are a couple of problems with the current monolithic version of
NewsRack.  Firstly, very few installations will be possible because of
the bandwidth required to download newspaper content every day.  For example,
with 10 newspapers, it is likely that monthly download might be of the
order of almost 1GB.  It is very likely that only very well-funded
organizations or organizations/individuals in the US or other developed
countries could afford the necessary bandwidth.  Public installations
as a web service (as envisaged currently) can help address this problem.

5.6 Challenges with copyright issues
- ------------------------------------
There can be potential problems with creating local copies of news
clippings without getting permission from newspapers.  This issue has not
yet been investigated, but, individual users can handle the scenario
as follows: (i) either by demonstrating the not-for-profit nature of
their work, and/or (ii) by keeping the local archives of news articles
private and local.  I do not yet foresee this to be a major problem, so
I am not going to be expending effort researching this issue.

- ----------------------
6. Further development
- ----------------------
Currently, a preliminary version has been deployed at
http://floss.sarai.net/newsrack.  As it stands, guests can browse news
archives of other users.  For registered users, news is automatically
downloaded, filtered, and archived in categories as specified by the user.

This section talks about further development of NewsRack and timeline.

6.1 Supporting non-RSS sources
- ------------------------------
Once the system stabilizes, I will work on providing support for non-RSS
news sources.  At that time, the tool itself will acquire a semblance of
completeness in terms of covering most of the English-language Indian
newspapers.  The other most important feature that needs to be supported
is the ability to search the news archive.

I will also explore the Google APIs to see if it can help the process.
Other than this, there are other news crawling services -- I will check
if there are tools that make this easy.

6.2 Improving the user interface
- --------------------------------
The current user interface is rudimentary.  In addition, there is no easy,
form-based input of filtering rules.  I have received feedback that such
form-based input will go a long way in making it easy for non-technical
users to use NewsRack.  In light of this, I will work on improving the
user interface so that:
. profiles can be developed using web-based forms without knowledge of XML.
. profiles can be edited online without having to be edit XML files.
. improve the presentation of news, categories, issues, so that it is easier
  to make sense and manage the news archive.

6.3 Other desirable features
- ----------------------------
Beyond this, there are a number of desirable features that will make the
tool very useful.  I list them here but will not elaborate on them in great
detail.
 1. refine/debug the filtering rules -- this is the ability to understand
    why an article got classified or not classified into a particular
    category, so that filtering rules can be improved.
 2. download and classify news from regional language newspapers.
 3. generate an RSS feed of daily updates
 4. send out email notification alerts with prominent headlines
    since previous day (related to RSS feed feature)
 5. attach annotations, notes to individual news items.  this can
    be very useful when working on reports or newsletters.
 6. select news from the archive and automatically generate a
    newsletter based on the selected news clippings.
 7. reclassify an existing news archive whenever a profile is
    modified.
 8. edit the archives -- remove an item from a category, move/copy
    an item from one category to another ..
 9. sort news items on other axes (like date, author, news source)
10. decentralize news downloading across multiple installations
    of NewsRack.  I have ideas about how several installations of NewsRack
    at multiple sites could collectively download all required news content
    such that any individual installation only downloads a fraction of the
    entire content.  When the tool reaches this stage, it might begin to
    resemble a peer-to-peer model of news downloading.

- -----------
7. Timeline
- -----------
It was mid-March of 2003 that I first thought of this problem and put
together a quick proposal for support from Sarai.  It has now been 11
months when I first started.  I have gone through a process of discussion
and visits to documentation centers, conceptualization, preliminary
design and development, deployment, and user feedback since then.  I am
doing this amidst several other things.  If a regular IT company were to
take up this project and work on it full time, a 2-person team would have
come to the stage where I am at currently in about 3-4 months time.
Given that, I think I have made good progress so far.

7.1 Priorities
- --------------
Going forward, I will prioritize the areas I am going to focus on.  Of all
the desirable features listed in 6.4, features 1, 3, 7, 8 take top priority.
Feature 10 is the lowest priority.

This is in addition to continuing to work on the user interface,
documentation, and classification algorithms -- all of which are ongoing
processes right through the development.

7.2 Enlisting developer help
- ----------------------------
I have already registered the project on Sourceforge.  However, I have not
yet made any initial releases, nor have I actively enlisted developers to
undertake development with me.  I hope to do both of these in the coming
few months.

- ---------------
8. Deliverables
- ---------------
First off, my experience of these last 11 months has shown me that I had
underestimated the time requirement of getting this far.  I had to pick up
the ropes in several areas, and got stuck at some places.  I envisage a
similar process this time around.  Having said that, I think it is safe
to say at the end of the next 6-month period, NewsRack will be more
user-friendly and should provide at least the following in addition to
its current capabilities (with bug fixes):
 . some form of form-based specification of filtering rules
 . some form of collaborative building of filtering rules
 . output RSS feeds for classified news
 . some form of news-archive management
We will also have a fairly good understanding of how well NewsRack works,
and what directions should be pursued going forward.

In addition, I am also hoping to work with few interested organizations and
individuals to showcase the utility of NewsRack, while being fully aware of
its limitations.