Will Hill, Larry Stead, Mark Rosenstein and George Furnas
Bellcore, 445 South Street, Morristown, NJ 07962-1910
lstead@bellcore.com, gwf@bellcore.com
Computer Graphics and Interactive Media Group Home Page
© ACM
With vast stores of multimedia events and objects to choose
from, future users of the national information infrastructure
will be overwhelmed with choices and human-computer
interface designers will be called upon to address the problem. The aim of this research is to evaluate the
power of a
particular form of virtual community to help users find
things they will like with minimal search effort.
Taking video selection as an initial test domain, the technique compares a viewer's personal ratings of
videos with
those of hundreds of others to find people with similar preferences and then recommends unseen videos that
these sim
ilar people have viewed and liked. The technique
outperforms by far a standard source of movie recommendations: nationally recognized movie critics.
The term community means "a group of people who share
characteristics and interact". The term virtual means "in
essence or effect only". Thus, by virtual community we
mean "a group of people who share characteristics and
interact in essence or effect only". In other words, people in
a Virtual Community influence each other as though they
interacted but they do not interact. Thus we ask: "Is it possible to arrange for people to share some of the
personalized
informational benefits of community involvement without
the associated communications costs?" Such costs might
include for example, the time costs of developing a personal
relationship, costs to privacy, costs of synchronous face-to-face communications.
We wish to contrast our idea of virtual community with two
popular themes in human interface work: virtual reality and
intelligent agents. First we draw the contrast with virtual
reality.
Popular future visions of networked computing and infrastructure marry perceptual immersion in virtual
reality to
high-bandwidth telecommunications. They seek a photorealistic and real-time "cyber-face to cyber-face"
social
environment [10]. This immersive vision expects total
involvement from participants. The result is what might be
called a virtual reality community with its central issues of
visual, auditory and temporal fidelity. By virtual community
we do not mean virtual reality community. The pitfalls of
seeking higher and higher fidelity to face-to-face communi
cation have been well discussed in Brothers et al. [2]. Virtual community is about attempting to realize
some of the
benefits of community without the associated communications costs.
A second popular vision of networked computing and infrastructure paints scenarios which include a large
role for
"intelligent agents". The idea is that of semi-autonomous
programs somehow endowed with intelligence great enough
to impress us with their ability to interpret our needs and
their work on our behalf. Our notion of virtual community
includes no central role for intelligent agents other than the
human participants in the virtual community.
Malone et al. [7] propose three types of information filtering
activities: cognitive, economic and social. Cognitive activities filter information based on content.
Economic filtering
activities filter information based on estimated search cost
and benefits of use. Social activities filter information based
on individual judgments of quality communicated through
personal relationships. This paper concentrates upon the
computer-assisted mediation of Malone's third type: social
filtering activities. However,a basic thesis of this work is
that personal relationships are not necessary to social filter
ing. In fact, social filtering and personal relationships can be
teased apart and put back together in interesting new ways.
For instance, the communication of quality judgments can
occur through less personal, and even impersonal relation
ships as well as personal relationships. Obviously, people
want a satisfying mix of both personal and impersonal relationships.
We have been particularly interested in how social filtering
activities can be simultaneously streamlined and enriched
through the careful design of communication media. The
social relationships in which filtering of information occurs
can be streamlined by making them less personal and
enriched by making them more personal. For example, add
ing or removing the communications costs of synchronous
face-to-face encounter, anonymity, and choosing a more
personal medium such voice or a less personal medium such
as text are all means of influencing the personal aspects of
communication. Social filtering can be simultaneously
streamlined and enriched by making some aspects of a relationship less personal while making other aspects
of the
relationship more personal.
In the realm of computer-assisted mediation of social filter
ing, a few HCI experiments sparsely dot the space of possible designs. Goldberg's Tapestry system [3] is a
site
oriented email system encouraging the entry of free text
annotations with which on-site users can later filter messages. Annotations are rich in high quality
information and
their successful uses are valuable. However, despite hopes
to the contrary, the twin tasks of writing annotations to enter
filtering data and specifying queries to use filtering data
require significant user effort. Domains where the invested
efforts pay off readily are few, but they do exist. In the case
of annotations where the method of entering filtering information for the benefit of others has significant
user costs,
Grudin's question [4] "Who does the work and who gets the
benefit?" becomes noticeably relevant.
Reacting against the trend of interface designers loading
additional tasks on users in order to help them find things,
the history-enriched digital objects approach (HEDO)
[5][6][11] attempts to explore a region of the interface
design space that minimizes additional user tasks. Through
a combination of automatic interaction history and graphics,
depictions of communal history within interface objects hint
at their use while user effort is minimized. HEDO tech
niques record the statistics of menu-selections, the count of
spreadsheet cell recalculations and time spent reading documents (e.g., email, reports, source-code,) in a
line-by-line
manner summing over sections and whole documents. Displays are simple shadings on menus, spreadsheets
and docu
ment scroll bars. Because the HEDO data are less
informative than annotations, they tend to be less useful, but
they cost less to gather and use. There is evidently a trade
off here.
One way to think about the trade-off is considering the two
approaches to social filtering mentioned so far as two ends
of spectrum. On one end of the spectrum we have social filtering interfaces that expect more work from the
user and
give more value. On the other end of the spectrum we have
interfaces that expect no additional work from the user but
provide less value. Our thought is that perhaps somewhere
in the middle of this spectrum between the two end alternatives, there might lie special niches that offer
relatively
more filtering value for relatively less filtering work. Such
locations on the spectrum, if they existed, we could call
design "sweet spots". Keywords:
Human-computer interaction, interaction history, computer-supported cooperative
work, organizational
computing, browsing, set-top interfaces, resource discovery,
video on demand.
Introduction
Virtual community, not virtual reality nor intelligent
agents
Relation of current work to previous research
We have in mind the ideal of a community of users routinely entering personal ratings of their interest concerning digital objects in the simplest form possible: a single keypress or gesture. These evaluations are pooled and analyzed automatically in service of the community of use. Members of this community, at their pleasure, receive recommendations of new or unfamiliar digital objects that they are likely to find interest ing.
Recommendations might, for instance, take the form of recommendation-enhanced browse-products that tatoo symbols of predicted interest upon object navigation and control points. Later on, Figure 4 shows such a Mosaic Browsing interface with recommendation enhanced hypermedia links and menus.
Of course the question is: does this kind of virtual community work? The answer as we will show is "yes" for videos and probably yes for many other forms of consumer level information items: books (categorized by author), video games, gaming scenarios, music, magazines and restaurants.
Concerning the use of ratings, Allen [1] reported unencouraging results on one of the first investigations (known to us) into personal ratings for HCI-type user-modeling. Recently, Resnick et al. [9] have designed a social filtering architecture based upon personal ratings and demonstrated its appli cation to work-group filtering of Netnews. In a study of eight users reading 8000 Netnews messages, Morita and Shinoda [8] observed strong positive correlations between time spent reading messages and personal interest ratings of those messages. Their work suggests it might be possible for time-on-task measures to stand in for ratings, further reducing user tasks.
In the process of achieving our overall goal of making personal evaluations do significant interface work for a virtual community, our approach illustrates a number of supportive community-oriented design goals:
Our design also embodies two research tactics.
In order to understand the power of recommending and evaluating choices in a virtual community, we posed three basic questions:
The second and third of these questions deserve further comment. The second question is straight-forward and standard statistical methods apply for answering it. On the third question, no standard measures have emerged as a consensus. At present, we consider two measures: (1) In a split-data test, how well do item ratings predicted by the recommending/ evaluating system correlate with actual ratings submitted by users? (2) How do users evaluate the results they see from the algorithms? We report on these measures in the Results section.
Our method was to seed a virtual community in the Internet and to do all the work necessary to exchange high quality recommendations among participants. People participated (and still participate) through an email interface at videos@bellcore.com. From October 1993 through May 1994 we col lected data on how the virtual community functions, how people like it, and how well it performs for participants.
The virtual community support provided by at videos@ bellcore.com consists of a generic object-oriented database to store and access preference efficiently and give out recommendations and evaluations. It is generic in the sense that one can construct various domains of items: videos, restaurants, books, document pages, and places to visit. In particular, at the time of our analysis, videos@bellcore.com included a data set of 55,000+ ratings of 1750 movies by 291 users. It includes recommending algorithms whose predictions improve as the data grow, and the number of movies, users and ratings and continues to grow daily.
The database is organized as set of interrelated instances of object classes. The objects are:
The database contains 17 modules. A single high level data base interface consisting of the following functions suffices to control it in most circumstances: load-database, save-database, add-user, erase-user, add-item, erase-item, add-ratings, recommend-items, evaluate-items.
Internet participants send a message containing "subject: ratings" to videos@bellcore.com. The system replies with an alphabetical list of 500 videos for the user to evaluate on a scale of 1-10 for the titles they have seen. Rating 1 is low and 10 is high. Users may also rate an unseen movie as "must-see" or "not-interested" as appropriate. Surprisingly, early usability tests showed that it was reasonable to expect self-selected Internet users to rate movies on an alphabetical list of 500 movies. However we do not expect this to be a feature of a deployed system. In order to reduce item/item bias, for every participant 250 of the 500 movies listed are selected randomly. To increase rating hits and to gather a standard set of data for purposes of fair comparison, for every participant the remaining 250 titles are a fixed set of popular movies.
When users return their movie ratings to videos@bellcore. com, an EMACS client process parses the incoming message, and passes ratings data inside a request for a recommendations-text to the server database process. The server process performs add-user, add-ratings and recommend-items. In the initial phase of adding ratings for a new user, ratings are added not only in the 1-10, "must-see" and "not-interested" categories, but also in the "unseen" category for titles that the user could have rated but did not. These unseen movies are the first pool from which to compute recommendations.
When a user is new, the database first looks for correlations between the new user's ratings and ratings from a random subsample of known users. We use the random subsample to limit the number of correlations computed to be O(n) rather than O(n2) in the number of participants. One-tenth of the new user's ratings are held out from the analysis for later quality testing purposes. The most similar users found are used as variables in a multiple-regression equation to predict the new user's ratings. The generated eq uation is then evaluated by predicting the held out one-tenth of the new user's ratings and then correlating these predictions with the actual ratings.
Once the predication equation exists, it is quite fast to evaluate every unseen movie, sort them by highest prediction and skim off the top to recommend. When recommended, movies are marked in the database as "pending-as-suggestion". A recommendation text is generated and passed back to the EMACS front-end client process where it is mailed back to the user or users.
The Internet email interface is currently a subject-line command interface and there are many commands for specialized actions. Further details are available by sending mail to videos@bellcore.com.
Here is sample reply from the system. Names have been changed to protect anonymity:
Suggested Videos for: John A. Jamus.
Your must-see list with predicted ratings:
Your video categories with average ratings:
The viewing patterns of 243 viewers were consulted. Patterns of 7 viewers were found to be most similar. Correlation with target viewer:
By category, their joint ratings recommend:Correlation of predicted ratings with your actual ratings is: 0.64 This num ber measures ability to evaluate movies accurately for you. 0.15 means low ability. 0.85 means very good ability. 0.50 means fair ability.
Suggested Videos for: Jane Robins, Jim Robins, together.
Your video categories with average ratings:
We have algorithms for two purposes, recommending items and evaluating items. Having tried a few versions of each, we report on the best we have discovered so far. We do not have evidence that these are the best algorithms possible, only that they are good. The algorithms we use for recommending have the following abstract functional form:
The function to return an evaluation of a proposed choice looks like this:
Currently the database consists of 291 participants in the community, 55,000 ratings on a 1-to-10 scale, another 2100 "must-see" or "not-interested" ratings, 64,000 "unseen" and 1200 "pending-as-suggestion" ratings. Of the 1750 movies in the database, 1306 have at least one rating and 739 have at least 3 ratings. 208 movies have more than 100 ratings, and 2 movies have more than 200 ratings. Users rate an average of 183 movies each with a standard deviation of 99. More than 220 of 291 total participants rated more than 100 movies. The database is small, but large enough to conservatively but accurately estimate a number of performance parameters.
For the 739 movies that have three or more ratings. Figure 2 shows the distribution of movies by their mean rating. Notice the slight bias toward positive ratings.
Six weeks after they initially tried videos@bellcore.com for the first time by submitting ratings and receiving recommendations, 100 early users were asked to re-rate exactly the same list of movie titles as they had rated the first time. 22 volunteers replied with a second set of ratings. Three outliers were removed from the reliability analysis since they correlated perfectly and were evidently copies of the original ratings rather than second independent sets of ratings. For the remain ing 19 users, on movies rated on both occasions, the Pearson r correlation between first-time and second-time ratings six weeks apart was 0.83 . This number gives a rough estimate how reliable a source of information the ratings are.
We held out 10% of every participant's movie ratings to provide a cross-validation test of accuracy. The cross-validated correlation of predicted ratings and actual ratings estimates how well our recommendation method is working. Figure 3 shows that our current best similar viewers algorithm correlates at 0.62 with user ratings. This is a strong positive correlation which means the recommendations are good. How good? We may expect three out of every four recommendations will be rated very highly by a potential viewer. We compared the quality of our virtual community recommendation method to a standard method of getting recommendations, that is, following the advice of movie critics. The ratings of movies by two nationally-known movie critics were entered. Their ratings correlate much more weakly at only the 0.22 level with viewer ratings. Thus the virtual community method is dramatically more accurate, as Figure 3 also shows.
Email responses from videos@bellcore.com include a request for open-ended feedback. Out of 51 voluntary responses, 32 were positive, 14 negative and 5 neutral. Here are some sample quotes:
Open ended feedback from users also indicated interest in establishing direct social contacts within their virtual community. Users can participant in either an anonymous or signed fashion. Interestingly, only four users exercised the anonymity option. Wishing to extend the social possibilities of the virtual community, two users asked if they could set "single and available" flags in the community indicating they wanted to use the community as a means of dating. One user found a long lost friend from junior high school. Another wrote that he took the high correlation between his movie tastes and those of someone he was dating as evidence for a long future relationship.
One of the standard uses of reliability measures is to put a bound on prediction performance. The basic idea is since a person's rating is noisy (i.e., has a random component in addtion to their more underlying true feeling about the movie) it will never be possible to predict their rating perfectly. Standard statistical theory says that the best one can do is the square root of the observed test-retest reliability correlation. (This is essentially because predicting what the user said once from what they said to the same question last time has noise in at both ends, squaring its effect. The correlation with the truth, if some technique could magically extract it, would have the noise in only once, and hence is bounded only by the square root of the observed reliability). The point to note here is that the observed reliability of 0.83 means that in theory one might be able to get a technique that predicts preference with a correlation of 0.91. The performance of techniques presented here, though much better than that of existing techniques, is still much below this ideal limit. Substantial improvements may be possible.
We see a potential for deployment to customers of national information access who will be faced with thousands of possible choices for information and entertainment, in addition to videos.
We have instantiated a version of our server where items are World Wide Web URLs (universal resource locators) in place of videos. Figure 4 displays a modified Mosaic browser interface that accepts ratings of WWW pages on a slider widget (near bottom) and reports them to an appropriate virtual community server. When a user clicks on the Recommend URL button (near bottom), the browser contacts the virtual community server to get recommended URLs and then fetches the recommended page. It also displays next to every hypertext link, one-half to four stars which represent the virtual community's predicted value of chasing down the hypertext link.
One direction in which we plan to push the research is toward more individual and social aspects. In particular we are interested in distributed peer-to-peer versions rather than the centralized client/server version that we have now. A wireless deployment of a peer-to-peer version could include wearable PCS devices, pairs of which will, when in close physical proximity, exchange ratings data for local virtual community computation. See Community and History-of-Use Navigation Home Page for further information.