Notes from a talk given 6 December 1999 at the Annenberg Center of the University of Southern California, and 29 October 1999, at the 1999 Annual Meeting of the Society for Social Studies of Science, San Diego, California. A separate abstract is available.
Databases are becoming increasingly important in many areas, including scientific research, technological development, law enforcement, commerce, and government. As an integral part of this process, databases are evolving and mutating in ways that may seem increasingly strange and even threatening. For example, the following trends are easily documented, and have been much discussed in the computer science and/or popular literatures:
In fact, there are significant methodological difficulties with the analysis of databases. Here we take the view that a database must not be separated from its use, where "use" includes not only the entire social context, but also the user interface, which embodies the modes of access that are allowed to the database. In addition, we take the view that the database-with-interface is a "text," to which methods like those of literary analysis and semiotics can be applied. However, these are peculiar texts, since they are dynamic, interactive, multimedia hypertexts. Most methodological issues, including some problematic metaphysical assumptions of traditional semiotics, and a suggested resolution based on techniques from ethnomethodology and theoretical computer science, are discussed in Section 4.
We illustrate our data, and later on our methods of analysis, with three webpages generated by Yahoo and AltaVista on 26 October 1999, in response to two simple queries; of course, the pages that result from these queries would be different today, but this in no way effects our conclusions (and anyway, any real data necessary exists in its own specific context, including its time). Although one of the queries is a bit atypical, it does provide data that is useful for our study. The pairs of windows B,C and D,E each cover one webpage between them, while window A is a whole page:
Warning: These are large postscript files that could take quite some time to download (3139476 bytes).A much larger number of webpages generated in a similar way were also examined in a manner similar to that described below, but they are not described in detail here.
In Window A, the query is "witless fungicidal," so it may be surprising that the search engine (which is Yahoo in this case) finds anything at all. Besides the usual banner advertisements and self-serving links, it presents us with 3 items. The one real item is a very long list of words for use by hackers to guess passwords! Along with much else (it begins with "aardvark"), this list includes the two words of our query (note that the search does not require that the words are in the same order that they were given in the query); apparently nothing else on the web contains both words (but we know this is false, because (an earlier draft of) this document was already on the web at that time [footnote added 6 June 2001: Amusingly, searches conducted today with Yahoo, Google, and some other engines did turn up this paper, but the hacker's password file had vanished.]).
Also in Window A is a right sidebar from amazon.com, offering to sell us a book on "Witless Fungi"! (By the way, there is no clever grammar program here that understands how to get "fungi" (the plural of "fungus") from "fungicidal," it just happens that length limitations cut the word at exactly the right place.) The third item is another sidebar, below that of amazon.com, asking us to sign up for ATT Long Distance, which has nothing at all to do with our query. Notice that these sidebars are highlighted by their contrastive yellow background (plus their use of other distinctive colors), since the main text area is black and blue on a gray background (the red in Window A results from a user preference - i.e., it was set by me - for indicating URLs that have been opened recently).
Now let's look at Window B, for which the query was just "Galileo" and the search engine was again Yahoo. This is a real query, made in preparing a class on the sociology of science and technology that I teach (CSE 275); I wanted to get basic biographical information about the scientist, Galileo Galilei, including his dates of birth and death. The way I centered this window leaves out the banner ads and the self-serving links. This query should have turned up a lot of URLs, but surprisingly, there are only 114 matches, plus the ever-present yellow amazon.com sidebar (but nothing from ATT).
It is interesting to notice how this webpage is structured. Yahoo first gives us 5 "category matches," which are occurrences of the keyword "Galileo" in its high level hierarchical category structure. Of these, 2 are commercial enterprises. The next large area contains the 114 "site matches," only the first 15 of which are on this page (which continues in Window C). This page (across the two windows) contains 5 commercial items (plus a totally unrelated banner ad at the bottom). Quite rightly in terms of average user interest, NASA's Galileo space probe has the most links, but it is strange that it comes fourth, with an esoteric Stanford University project on gravity waves being first. Only 4 URLs concern the scientist Galileo; 3 are in the third group of items, after the Stanford project and a software product, and a travel agency listed under "entertainment." The bottom of this page has a clickable rectangle that leads to a page containing the next 20 items. This same structuring mechanism continues over 6 more pages, so that it is pretty tedious to reach items near the end. In total, neglecting the banner ads and sidebars, 7 out of 20 items in the two lists on this page are commercial, which is 35% (a figure that is typical of such searches).
Window D shows how the AltaVista search engine answers the same "Galileo" query. Again I have left out some annoying material from the top of the page. In sharp contrast to Yahoo, AltaVista found 395,700 items. Again there is an amazon.com box in the same contrastive color scheme, but this time it does not promise books on the specific topic of the query, nor is it the first sidebar item; that honor goes to TD Waterhouse, a stock brokerage. The first ten hits are on this page, which continues in Window E. At the rate of 10 items per page, there are a total of 39,570 pages. Links to the first 20 of these pages are given at the bottom of the first page (see Window E); at this rate, the minimum number of clicks required to get to the last item is 1,979! So this is exceedingly unlikely ever to happen.
This time, the Galileo space probe gets first place, a web design company gets second place, and the scientist gets third place (the space probe also gets fourth place). It is nice that the URLs are explicitly shown, along with the date last modified, the file size, and a language translation option. Of these first 10 URLs, 4 are commercial sites (counting a Galileo come on from a portal), and there are no duplicates on this page (though some items are closely related).
It should be noted that search engine databases contain a significant number of dead links (i.e., URLs that point to no longer existing webpages), estimated at between 2% and 5%, due to the slow URL update rate, the huge size and rapid growth rate of the web, and the consequently high "link rot" on the web. In fact, item 5 on Window E refers to one such link.
To capture some aspects of the way that an ideal search engine should process the "witless fungicidal" query using the standard query language SQL requires some fairly tricky coding, which is shown and partially explained in Appendix A (SQL is the most widely used of all traditional database query languages); it took a highly competent computer scientist (Kai Lin, a graduate student at UCSD) who was already fairly familiar with SQL several hours to write and debug this code. This is interesting as a contrast to the ease of use of the search engine GUIs, as discussed further in the next subsection.
First, it seems clear that monetary interests are distorting the view that users get of what is on the web. This can be seen especially clearly in the very special placement and coloration used for the amazon.com link, but it is also clear in the fact that consumer oriented items tend to be higher on the lists than other items, as reflected in the higher percentage of commercial URLs in the earliest parts of these lists. In fact, it is well known that businesses pay search engine owners to boost their visibility, and in particular, that amazon.com pays Yahoo and AltaVista for each hit onto its own website that comes from their sites. The same holds to a lesser extent for many other commercial enterprises, which pay (to a lesser extent) to have their placement artificially raised in site lists. However, our goal in this paper is to deduce such distortions from textual and contextual evidence.
In comparison with traditional database query mechanisms like SQL, the web-based GUIs of the search engines take far better account of the needs and capabilities of ordinary users. The fact that one can click on an item and then (almost) immediately see that item (if it is available) is very convenient, as is (in principle) the fact that items are prioritized and structured in advance by the engine, rather than by the user. Limiting the number of items on each page to 10 or 20 is helpful, as is the categorization system employed by Yahoo (though walking through such a system to frame a query is by comparison rather awkward). The use of simple layout and color conventions is good, though of course it is annoying to users that the most prominent items are often of little relevance; however, this too is a significant part of the design, used to highlight the revenue-producing items.
Because of the fierce competition among search engines, there is a strong pressure to give users what they want, which is first of all the particular piece of information they seek, and secondly the easiest possible way of accessing that information.
(This paragraph is just to tie up a loose end: It could be argued that many users of search engines in fact do want to buy something. But looking a little closer at the businesses involved in our data deflates that argument. How many users really want to find a travel agency in Ireland, or a buy a professional Java database development tool, or purchase help from an internet business consulting firm in Ontario Canada? I am taking it for granted that the answer is "very few." Of course, the search engine owners could give us a much more precise answer, but they have good reasons for not wanting to do so.)
Still, one would like to have a better sense of just what is the nature of this kind of analysis, in order to better understand its strengths and limitations, and to apply it more effectively in practice. Every analysis must contend with incompleteness in its data, both text and context, and every analysis must deal with the question of how its inferences can be justified on the basis of its incomplete information. This requires a deeper look at theories that might support such analyses, the most obvious of which is semiotics. Unfortunately, traditional semiotics, as expounded by Peirce, Eco, Saussure and others, comes with a heavy metaphysical load, generally taking a realist, Platonist view, according to which signs and their meanings actually exist in some real world (such as Plato's alleged realm of pure ideas). Furthermore, none of the various semiotic theories known to this author are precise enough to deal effectively with the enormous precision and detail that is typical of computer-based information artifacts, and that is required for engineering applications.
One foundational approach that does seem suitable for studies of the kind undertaken here is the author's "social theory of information," which is based on a combination of semiotics and ethnomethodology. In brief, this theory says
an item of information is defined socially, by reference to some group to which it is important, and semiotically, through the system of differences in which it participates (these differences are of course also socially defined).A similar notion was called a category system by Harvey Sacks in ethnomethodology. Extra precision is available through the author's algebraic semiotics, which combines social semiotics with algebraic specification. Although such precision is hardly needed for present purposes, having it in the background does mean that the technical terms and arguments in our study can be made much more rigorous than is usual, and can potentially be automated to a certain extent. Moreover, the more precise conceptual framework also provides very useful guidelines for structuring an inquiry such as the present one, and in particular, the notion of semiotic morphism provides a suggestive unity to several of the observations about search engine database access made in this study, since they can be formulated in terms of the extent to which certain semiotic structures are, or are not, preserved as they are mapped onto the GUI. For an example of the use of this terminology, our findings include that the ordering of websites by user popularity is not preserved by the semiotic morphism that places URLs on the website lists that users see, nor is the simple list structure preserved by the colorful advertising sidebars.
Although the analytic methods employed here can be formalized in the mathematical language of algebraic semiotics, this does not mean that they can be divorced from the context of their actual socially situated use, as is often assumed in mathematicized studies of social phenomena, presumably due to an implicit Platonist philosophy on the part of many (most?) formalizers.
Many facets of an organization's life are reflected in its databases, including the non-representation of certain features. Databases are significant for the Bowker & Star study, because they mediate so much of modern healthcare, and because there is a tendency to regard only what is explicitly represented in a database as "real," with the rest being implicit, or in the terminology of Bowker and Star, relegated to an invisible infrastructure. Even if some hospital administrators believe that something not represented in a database is important, they will have great difficulty in justifying this belief unless relevant figures can be included in the reports that are read by administrators higher up the "chain of translations." However these reports are generated by application programs that run off the hospital databases. Because classification schemes are necessary for entering information into a database, classifications and standards became a key area for the debate between the nurses and the administrators.
There is also a tendency for database intensive organizations to resist change, and hence to freeze a status quo, because it can be very difficult to change the structure of a large database once it has been deployed. The difficulty of changing database strucuture also implies that any associated classification schemes and standards will be difficult to change. Moreover, mechanistic and reductionist views of organizational operation tend to be reinforced, because they are easier to implement in a database and its application programs for analysis and reporting, than would be more "holistic," "reflexive," or "ecological" views that the nurses tend to favor; for the same reason, quantification is reinforced. We can consider all these to be values of the underlying database technology.
This discussion shows that the conflicting values that originally seemed to involve only the nurses and administrators, also involve the values embodied in the infrastructural hospital database systems in a crucial way. We can now paraphrase the story told by Bowker and Star as saying that the hospital databases are allied with the administration, but that the nurses are actively seeking to recruit them, by trying to compromise their own values with those of the databases (and the administrators). (Of course, the terminology that I use here owes a great debt to the work of Bruno Latour.)
Our analysis demonstrated that the apparently unproblematic notion of "searching a database" (and in particular, of "searching the web") is heavily value-laden, involving socially situated economic and cultural interests, as well as technical factors. More significantly, we found that although there is a conflict between the value systems of users and search engine owners, success for both depends on achieving an adequate compromise between their value systems. We also noted a similar pattern in a study by Bowker & Star of the representation of nursing interventions. Moreover, both studies show that not only the database users and owners, but also the databases themselves have values, for which translations and compromises must also be effected in order for the overall system to succeed. Another interesting compromise is that between a search engine database and the web itself: the web is too huge to be captured by the database, and too full of junk for the database to want to capture all of it; so ways must be found to store feasible amounts of relatively high quality information. Moreover, the web is changing too quickly for a database to have a accurate representation of it. Such patterns of translations and compromises are also important in requirements engineering, a discipline that is concerned with reconciling the social and technical factors of systems, in order to determine the properties that the system must have in order to succeed [2].
At the methodological level, although we found that methods similar to those of literary analysis and ethnography were sufficient for our analyses, we also claimed that these methods could be grounded in semiotics, and thereby made more rigorous. The paper also discussed the problematic metaphysical assumptions of standard semiotics, and sketched an approach that avoids them, by using a socially grounded semiotics, based on a social theory of information that draws on ideas from ethnomethodology as well as from theoretical computer science. Actually, the more rigorous methods of social algebraic semiotics were used by the author in doing this study, in the form of precise guidelines that can still be applied in a relatively informal way.
This paper is part of a larger project to examine the "natural ethics" of information artifacts, exploring the hypothesis that specific but implicit values are embodied in the designs of such artifacts, and that these values, as well as those of their users and designers, can be uncovered by semiotic techniques analoguous to those of literary analysis, and then subjected to further analysis. Future research will consider information artifacts other than web search engines. Among other goals, we hope to raise awareness of the new possibilities for manipulation that are inherent in current information technology, and of the necessity for compromise among value systems in order to successfully deploy real systems, as well as to explore the ways in which such compromises are achieved, and how they can be studied.
SELECT entities FROM table where INSTR(entities,'witless',1,1) < INSTR(entities,'fungicidal',1,1) and INSTR(entities,'witless',1,1) > 0 order by entities UNION SELECT entities FROM table where INSTR(entities,'fungicidal',1,1) < INSTR(entities,'witless',1,1) and INSTR(entities,'fungicidal',1,1) > 0 order by entities UNION SELECT entities FROM table where INSTR(entities,'witless',1,1) > 0 INSTR(entities,'fungicidal',1,1) = 0 order by entities UNION SELECT entities FROM table where INSTR(entities,'fungicidal',1,1) > 0 INSTR(entities,'witless',1,1) = 0 order by entities
www.cs.ucsd.edu/users/goguen/papers/4s/4s.html
.