THE ETHICS OF DATABASES

Joseph A Goguen
Meaning and Computation Lab
Dept. Computer Science & Engineering
University of California, San Diego
email: jgoguen@ucsd.edu

Notes from a talk given 6 December 1999 at the Annenberg Center of the University of Southern California, and 29 October 1999, at the 1999 Annual Meeting of the Society for Social Studies of Science, San Diego, California. A separate abstract is available.

1. INTRODUCTION

Databases are becoming increasingly important in many areas, including scientific research, technological development, law enforcement, commerce, and government. As an integral part of this process, databases are evolving and mutating in ways that may seem increasingly strange and even threatening. For example, the following trends are easily documented, and have been much discussed in the computer science and/or popular literatures:

increasing size, tending towards the truly vast;
increasing sophistication and convenience of access mechanisms, tending towards analysis and away from mere query;
increasing invisibility, by absorption into the application and/or the user interface;
increasing circulation and sharing of information;
increasing coordination with other databases and applications;
increasing amounts of personal information;
increasing commercialization of information; and
improving security for database owners -- but not for ordinary consumers, citizens, etc.

It should be evident that there are ethical implications to the ways in which the information in databases is gathered and used, and that the combination of the eight trends listed above is a cause for serious ethical concern. Although these trends are confirmed by our analysis of the data of this study, they are not our main concern. Instead we focus on an issue that may be less obvious but more fundamental, that ethical values are embedded in the ways that databases are structured, and more specifically, in the ways that database access is structured. This supports analyses and arguments that are more empirical and more refined than those that are usually found in the literature on ethics.

In fact, there are significant methodological difficulties with the analysis of databases. Here we take the view that a database must not be separated from its use, where "use" includes not only the entire social context, but also the user interface, which embodies the modes of access that are allowed to the database. In addition, we take the view that the database-with-interface is a "text," to which methods like those of literary analysis and semiotics can be applied. However, these are peculiar texts, since they are dynamic, interactive, multimedia hypertexts. Most methodological issues, including some problematic metaphysical assumptions of traditional semiotics, and a suggested resolution based on techniques from ethnomethodology and theoretical computer science, are discussed in Section 4.

2. DATA

This study examines the values that can be imputed to well known web "search engines." Despite this name, and despite the illusion they create that the actual web is being searched for you in real time, they are actually searching a very large database of URLs (Uniform Resource Locators, to resources on the web) and of keywords; the main difference from an ordinary database is the nice GUI (graphical user interface) that they provide. This database is updated very slowly in comparison with the rate of change of the web, and moreover, despite their large size, search engine databases typically only cover a relatively small part of the web.

We illustrate our data, and later on our methods of analysis, with three webpages generated by Yahoo and AltaVista on 26 October 1999, in response to two simple queries; of course, the pages that result from these queries would be different today, but this in no way effects our conclusions (and anyway, any real data necessary exists in its own specific context, including its time). Although one of the queries is a bit atypical, it does provide data that is useful for our study. The pairs of windows B,C and D,E each cover one webpage between them, while window A is a whole page:

Warning: These are large postscript files that could take quite some time to download (3139476 bytes).

A much larger number of webpages generated in a similar way were also examined in a manner similar to that described below, but they are not described in detail here.

3. ANALYSIS

Our initial research goal is to determine the structure of access to information that is provided by web search engines, from the user's perspective (we are not concerned with the structure of the internal implementation).

3.1 Low Level Analysis

In Window A, the query is "witless fungicidal," so it may be surprising that the search engine (which is Yahoo in this case) finds anything at all. Besides the usual banner advertisements and self-serving links, it presents us with 3 items. The one real item is a very long list of words for use by hackers to guess passwords! Along with much else (it begins with "aardvark"), this list includes the two words of our query (note that the search does not require that the words are in the same order that they were given in the query); apparently nothing else on the web contains both words (but we know this is false, because (an earlier draft of) this document was already on the web at that time [footnote added 6 June 2001: Amusingly, searches conducted today with Yahoo, Google, and some other engines did turn up this paper, but the hacker's password file had vanished.]).

Also in Window A is a right sidebar from amazon.com, offering to sell us a book on "Witless Fungi"! (By the way, there is no clever grammar program here that understands how to get "fungi" (the plural of "fungus") from "fungicidal," it just happens that length limitations cut the word at exactly the right place.) The third item is another sidebar, below that of amazon.com, asking us to sign up for ATT Long Distance, which has nothing at all to do with our query. Notice that these sidebars are highlighted by their contrastive yellow background (plus their use of other distinctive colors), since the main text area is black and blue on a gray background (the red in Window A results from a user preference - i.e., it was set by me - for indicating URLs that have been opened recently).

Now let's look at Window B, for which the query was just "Galileo" and the search engine was again Yahoo. This is a real query, made in preparing a class on the sociology of science and technology that I teach (CSE 275); I wanted to get basic biographical information about the scientist, Galileo Galilei, including his dates of birth and death. The way I centered this window leaves out the banner ads and the self-serving links. This query should have turned up a lot of URLs, but surprisingly, there are only 114 matches, plus the ever-present yellow amazon.com sidebar (but nothing from ATT).

It is interesting to notice how this webpage is structured. Yahoo first gives us 5 "category matches," which are occurrences of the keyword "Galileo" in its high level hierarchical category structure. Of these, 2 are commercial enterprises. The next large area contains the 114 "site matches," only the first 15 of which are on this page (which continues in Window C). This page (across the two windows) contains 5 commercial items (plus a totally unrelated banner ad at the bottom). Quite rightly in terms of average user interest, NASA's Galileo space probe has the most links, but it is strange that it comes fourth, with an esoteric Stanford University project on gravity waves being first. Only 4 URLs concern the scientist Galileo; 3 are in the third group of items, after the Stanford project and a software product, and a travel agency listed under "entertainment." The bottom of this page has a clickable rectangle that leads to a page containing the next 20 items. This same structuring mechanism continues over 6 more pages, so that it is pretty tedious to reach items near the end. In total, neglecting the banner ads and sidebars, 7 out of 20 items in the two lists on this page are commercial, which is 35% (a figure that is typical of such searches).

Window D shows how the AltaVista search engine answers the same "Galileo" query. Again I have left out some annoying material from the top of the page. In sharp contrast to Yahoo, AltaVista found 395,700 items. Again there is an amazon.com box in the same contrastive color scheme, but this time it does not promise books on the specific topic of the query, nor is it the first sidebar item; that honor goes to TD Waterhouse, a stock brokerage. The first ten hits are on this page, which continues in Window E. At the rate of 10 items per page, there are a total of 39,570 pages. Links to the first 20 of these pages are given at the bottom of the first page (see Window E); at this rate, the minimum number of clicks required to get to the last item is 1,979! So this is exceedingly unlikely ever to happen.

This time, the Galileo space probe gets first place, a web design company gets second place, and the scientist gets third place (the space probe also gets fourth place). It is nice that the URLs are explicitly shown, along with the date last modified, the file size, and a language translation option. Of these first 10 URLs, 4 are commercial sites (counting a Galileo come on from a portal), and there are no duplicates on this page (though some items are closely related).

It should be noted that search engine databases contain a significant number of dead links (i.e., URLs that point to no longer existing webpages), estimated at between 2% and 5%, due to the slow URL update rate, the huge size and rapid growth rate of the web, and the consequently high "link rot" on the web. In fact, item 5 on Window E refers to one such link.

To capture some aspects of the way that an ideal search engine should process the "witless fungicidal" query using the standard query language SQL requires some fairly tricky coding, which is shown and partially explained in Appendix A (SQL is the most widely used of all traditional database query languages); it took a highly competent computer scientist (Kai Lin, a graduate student at UCSD) who was already fairly familiar with SQL several hours to write and debug this code. This is interesting as a contrast to the ease of use of the search engine GUIs, as discussed further in the next subsection.

3.2 Higher Level Analysis

Based on the above relatively low level observations, we can begin to draw some higher level conclusions. These would have to be considered very tentative if they were really based on only 3 webpages, but the same patterns have been consistently observed in numerous other pages produced by these and other search engines, in response to these and many other queries.

First, it seems clear that monetary interests are distorting the view that users get of what is on the web. This can be seen especially clearly in the very special placement and coloration used for the amazon.com link, but it is also clear in the fact that consumer oriented items tend to be higher on the lists than other items, as reflected in the higher percentage of commercial URLs in the earliest parts of these lists. In fact, it is well known that businesses pay search engine owners to boost their visibility, and in particular, that amazon.com pays Yahoo and AltaVista for each hit onto its own website that comes from their sites. The same holds to a lesser extent for many other commercial enterprises, which pay (to a lesser extent) to have their placement artificially raised in site lists. However, our goal in this paper is to deduce such distortions from textual and contextual evidence.

In comparison with traditional database query mechanisms like SQL, the web-based GUIs of the search engines take far better account of the needs and capabilities of ordinary users. The fact that one can click on an item and then (almost) immediately see that item (if it is available) is very convenient, as is (in principle) the fact that items are prioritized and structured in advance by the engine, rather than by the user. Limiting the number of items on each page to 10 or 20 is helpful, as is the categorization system employed by Yahoo (though walking through such a system to frame a query is by comparison rather awkward). The use of simple layout and color conventions is good, though of course it is annoying to users that the most prominent items are often of little relevance; however, this too is a significant part of the design, used to highlight the revenue-producing items.

Because of the fierce competition among search engines, there is a strong pressure to give users what they want, which is first of all the particular piece of information they seek, and secondly the easiest possible way of accessing that information.

Regarding the first goal, which we may call "populism," search engines keep track of user queries, and some of them use that information to raise the priorities of the most popular sites. In addition, some engines also try to count the number of links to pages on the web from other pages, and then use that as measure of their popularity, again for setting priorities. We saw, however, that this goal is not always very well met, with commercial interests being the chief distorting factor, though there were some other unexplained anomalies (e.g., the top rating for gravity waves).
Regarding the second goal, it is interesting to compare Yahoo and AltaVista. Looking first at the information that each provides, we can see that AltaVista assumes more sophistication of its users than does Yahoo (this is also suggested by their names). For example, Yahoo gave only a very small number of items for the Galileo query, and presented only rather simple information about those items. The Yahoo query mechanism is also much more limiting, and Yahoo treats the amazon.com box in a way that is less suitable for sophisticated users.

Compared to the sophisticated GUI access mechanisms of web search engines, traditional database query languages like SQL embody a staunchly modernist value system, that is highly unsuitable for this domain, because these search engines are intended to be used by ordinary people who may have very little experience with or knowledge of computers. Here "modernist" refers to the assumption of a well-defined monolithic world that can be taken for granted, and that in particular is well understood by users.

(This paragraph is just to tie up a loose end: It could be argued that many users of search engines in fact do want to buy something. But looking a little closer at the businesses involved in our data deflates that argument. How many users really want to find a travel agency in Ireland, or a buy a professional Java database development tool, or purchase help from an internet business consulting firm in Ontario Canada? I am taking it for granted that the answer is "very few." Of course, the search engine owners could give us a much more precise answer, but they have good reasons for not wanting to do so.)

4. THEORY

The analyses in Section 3 sought to reveal values by techniques broadly similar to those used in the study of literature, e.g., determining what is important by examining placement, color, size, participation in repetitive patterns, ease of access, participation in contrastive patterns, etc. Despite the dynamic, interactive, multimedia hypertext nature of our texts, this proved to be fairly straightforward, and in many cases our conclusions could also be confirmed independently by information that is generally available within the computer science and web business communities.

Still, one would like to have a better sense of just what is the nature of this kind of analysis, in order to better understand its strengths and limitations, and to apply it more effectively in practice. Every analysis must contend with incompleteness in its data, both text and context, and every analysis must deal with the question of how its inferences can be justified on the basis of its incomplete information. This requires a deeper look at theories that might support such analyses, the most obvious of which is semiotics. Unfortunately, traditional semiotics, as expounded by Peirce, Eco, Saussure and others, comes with a heavy metaphysical load, generally taking a realist, Platonist view, according to which signs and their meanings actually exist in some real world (such as Plato's alleged realm of pure ideas). Furthermore, none of the various semiotic theories known to this author are precise enough to deal effectively with the enormous precision and detail that is typical of computer-based information artifacts, and that is required for engineering applications.

One foundational approach that does seem suitable for studies of the kind undertaken here is the author's "social theory of information," which is based on a combination of semiotics and ethnomethodology. In brief, this theory says

an item of information is defined socially, by reference to some group to which it is important, and semiotically, through the system of differences in which it participates (these differences are of course also socially defined).

A similar notion was called a category system by Harvey Sacks in ethnomethodology. Extra precision is available through the author's algebraic semiotics, which combines social semiotics with algebraic specification. Although such precision is hardly needed for present purposes, having it in the background does mean that the technical terms and arguments in our study can be made much more rigorous than is usual, and can potentially be automated to a certain extent. Moreover, the more precise conceptual framework also provides very useful guidelines for structuring an inquiry such as the present one, and in particular, the notion of semiotic morphism provides a suggestive unity to several of the observations about search engine database access made in this study, since they can be formulated in terms of the extent to which certain semiotic structures are, or are not, preserved as they are mapped onto the GUI. For an example of the use of this terminology, our findings include that the ordering of websites by user popularity is not preserved by the semiotic morphism that places URLs on the website lists that users see, nor is the simple list structure preserved by the colorful advertising sidebars.

Although the analytic methods employed here can be formalized in the mathematical language of algebraic semiotics, this does not mean that they can be divorced from the context of their actual socially situated use, as is often assumed in mathematicized studies of social phenomena, presumably due to an implicit Platonist philosophy on the part of many (most?) formalizers.

5. A RELATED STUDY

We focus on some results that are complementary to those discussed here, from an important case study by Bowker and Star, [1]. This paper examines the struggle of nurses to adapt to a healthcare environment dominated by HMOs with their heavy emphasis on accountancy. The paper is centrally concerned with classifications and standards, especially the Nursing Intervention Classification, an attempt by nurses to develop a new classification system for their activities that better represents what they see as important, such as comforting patients.

Many facets of an organization's life are reflected in its databases, including the non-representation of certain features. Databases are significant for the Bowker & Star study, because they mediate so much of modern healthcare, and because there is a tendency to regard only what is explicitly represented in a database as "real," with the rest being implicit, or in the terminology of Bowker and Star, relegated to an invisible infrastructure. Even if some hospital administrators believe that something not represented in a database is important, they will have great difficulty in justifying this belief unless relevant figures can be included in the reports that are read by administrators higher up the "chain of translations." However these reports are generated by application programs that run off the hospital databases. Because classification schemes are necessary for entering information into a database, classifications and standards became a key area for the debate between the nurses and the administrators.

There is also a tendency for database intensive organizations to resist change, and hence to freeze a status quo, because it can be very difficult to change the structure of a large database once it has been deployed. The difficulty of changing database strucuture also implies that any associated classification schemes and standards will be difficult to change. Moreover, mechanistic and reductionist views of organizational operation tend to be reinforced, because they are easier to implement in a database and its application programs for analysis and reporting, than would be more "holistic," "reflexive," or "ecological" views that the nurses tend to favor; for the same reason, quantification is reinforced. We can consider all these to be values of the underlying database technology.

This discussion shows that the conflicting values that originally seemed to involve only the nurses and administrators, also involve the values embodied in the infrastructural hospital database systems in a crucial way. We can now paraphrase the story told by Bowker and Star as saying that the hospital databases are allied with the administration, but that the nurses are actively seeking to recruit them, by trying to compromise their own values with those of the databases (and the administrators). (Of course, the terminology that I use here owes a great debt to the work of Bruno Latour.)

6. CONCLUSIONS, DISCUSSIONS, AND EXTENSIONS

We found that monetary values play a very significant role in structuring the modes of access to web search engine databases, distorting how users see the web by promoting commercial interests. This is part of the trend towards the increased commercialization of information that was mentioned in Section 1. In fact, this case study found evidence for each of the eight points mentioned in Section 1 (although we do not address the security and privacy issues in this paper). We also found that the values of users are reflected, both through the "populist" ethic (partially) used for URL selection, and through the account taken of the biological, cognitive and social natures of users in designing the GUI, e.g., in choices of color and size in layout, as well as in the ordering of URLs in site lists. This case study can also be considered a suggestive example of the passage from (relatively less situated) `data' to (relatively more situated) `information' (or some might prefer to say, from `information' to `knowledge'), illustrating some of the ways that values and social context enters into such a process.

Our analysis demonstrated that the apparently unproblematic notion of "searching a database" (and in particular, of "searching the web") is heavily value-laden, involving socially situated economic and cultural interests, as well as technical factors. More significantly, we found that although there is a conflict between the value systems of users and search engine owners, success for both depends on achieving an adequate compromise between their value systems. We also noted a similar pattern in a study by Bowker & Star of the representation of nursing interventions. Moreover, both studies show that not only the database users and owners, but also the databases themselves have values, for which translations and compromises must also be effected in order for the overall system to succeed. Another interesting compromise is that between a search engine database and the web itself: the web is too huge to be captured by the database, and too full of junk for the database to want to capture all of it; so ways must be found to store feasible amounts of relatively high quality information. Moreover, the web is changing too quickly for a database to have a accurate representation of it. Such patterns of translations and compromises are also important in requirements engineering, a discipline that is concerned with reconciling the social and technical factors of systems, in order to determine the properties that the system must have in order to succeed [2].

At the methodological level, although we found that methods similar to those of literary analysis and ethnography were sufficient for our analyses, we also claimed that these methods could be grounded in semiotics, and thereby made more rigorous. The paper also discussed the problematic metaphysical assumptions of standard semiotics, and sketched an approach that avoids them, by using a socially grounded semiotics, based on a social theory of information that draws on ideas from ethnomethodology as well as from theoretical computer science. Actually, the more rigorous methods of social algebraic semiotics were used by the author in doing this study, in the form of precise guidelines that can still be applied in a relatively informal way.

This paper is part of a larger project to examine the "natural ethics" of information artifacts, exploring the hypothesis that specific but implicit values are embodied in the designs of such artifacts, and that these values, as well as those of their users and designers, can be uncovered by semiotic techniques analoguous to those of literary analysis, and then subjected to further analysis. Future research will consider information artifacts other than web search engines. Among other goals, we hope to raise awareness of the new possibilities for manipulation that are inherent in current information technology, and of the necessity for compromise among value systems in order to successfully deploy real systems, as well as to explore the ways in which such compromises are achieved, and how they can be studied.

REFERENCES

Geoffrey Bowker and Susan Leigh Star, How things (actor-net)work: Classification, magic and the ubiquity of standards, (technical report, University of Illinois at Champaign-Urbana, 1998).
Joseph Goguen, "Requirements Engineering as the Reconciliation of Technical and Social Issues", in Requirements Engineering: Social and Technical Issues, edited with Marina Jirotka, Academic Press, 1994, pages 165-199.
Joseph Goguen, "Towards a Social, Ethical Theory of Information," in Social Science, Technical Systems and Cooperative Work: Beyond the Great Divide, edited by Geoffrey Bowker, Leigh Star, William Turner and Les Gasser. Erlbaum, 1977, pages 27-56. www.cs.ucsd.edu/users/goguen/ps/sti.ps.gz.
Joseph Goguen, "An Introduction to Algebraic Semiotics, with Applications to User Interface Design", in Computation for Metaphors, Analogy and Agents, edited by Chrystopher Nehaniv. Springer Verlag, Lecture Notes in Artificial Intelligence, Volume 1562, 1999, pages 242-291. www.cs.ucsd.edu/users/goguen/ps/as.ps.gz.
Joseph Goguen, CSE 275, Fall 1999, Fall 2000, Spring 2001. www.cs.ucsd.edu/users/goguen/courses/275/.
Joseph Goguen, Semiotic Morphisms, an online tutorial, October 1996, with later edits. www.cs.ucsd.edu/users/goguen/papers/sm/smm.html.
ITKnowledge.com, on line books on SQL. www.itknowledge.com/reference/dir.programminglanguages.sql1.html.
Bruno Latour, Aramis, or the Love of Technology, Harvard, 1996,
Harvey Sacks. "On the Analyzability of Stories by Children," in Directions in Sociolinguistics, edited by John Gumpertz and Del Hymes. Holt, Rinehart and Winston, 1972, pages 325-345.
Harvey Sacks. Lectures on Conversation, edited by Gail Jefferson, Blackwell, 1992.

APPENDIX: THE SQL CODE

A query written in (ANSI standard) SQL to capture part of how an ideal search engine should process the "witless fungicidal" query is given below. First preference is given to pages that contain both words in the given order; second preference to pages that contain both words in the opposite order; third preference to pages that contain just the first word; and last preference to pages that contain just the second word. This code does not attempt to take account of observed patterns of user preference, only the preferences implicit in the structure of the original query. The code was by Kai Lin, to whom the author offers his thanks.

  SELECT entities FROM table
    where INSTR(entities,'witless',1,1) < INSTR(entities,'fungicidal',1,1) and
	  INSTR(entities,'witless',1,1) > 0
    order by entities
  UNION
  SELECT entities FROM table
    where INSTR(entities,'fungicidal',1,1) < INSTR(entities,'witless',1,1) and
	  INSTR(entities,'fungicidal',1,1) > 0  
    order by entities
  UNION
  SELECT entities FROM table
    where INSTR(entities,'witless',1,1)    > 0
	  INSTR(entities,'fungicidal',1,1) = 0
    order by entities
  UNION
  SELECT entities FROM table
    where INSTR(entities,'fungicidal',1,1) > 0
	  INSTR(entities,'witless',1,1)    = 0
    order by entities

Note to readers of hardcopy versions of this paper: Some citations are "in-lined" as hypertext links in the online version of this paper, at www.cs.ucsd.edu/users/goguen/papers/4s/4s.html.

To an abstract for this paper

To my publications homepage

Maintained by Joseph Goguen
Draft completed 6 December 1999; edited 6 June 2001