Possibilities and Problems of Digital History and Digital Collections

As a history student, I found this to be an utterly fascinating session. It was led by Roy Rosenzweig and Dan Cohen, both of the Center for History and New Media at George Mason University. The two presenters are co-authors of Digital History: A Guide to Gathering, Preserving and Presenting the Past on the Web.

I missed the first few minutes or so of the presentation, but I got enough out of the rest to more than make up for it. The session discussed some of the projects undertaken by the CHNM such as the Hurricane Digital Memory Bank, the 9/11 Digital Archive, Firefox Scholar and the Syllabus Finder.

They pointed out the new opportunities for historical research presented as large digital collections become available. For example, Dan talked about how he had extracted all items in the center’s September 11 digital repository that mentioned 9 AM, then geocoded them and plotted them using Google Maps. This allows viewers to pan around Lower Manhattan and click on pushpins to see and hear the experiences of people just after the planes crashed into the twin towers.

As another example, he discussed taking mentions of CNN, Fox News, the radio, and prayer and plotting them on a map of the United States using Google Earth. And of using the massive collection of syllabi gathered through the center’s syllabus finder to study which books are being used in history courses in the US. (And noted that if teachers are consistently assigning outside reading in, say, African-American history, this may be an indication that textbook authors are not covering the topic adequately in core texts.)

Dan lamented the fact that the APIs that enable this type of novel research are currently being offered mostly by private companies, and he called on librarians and archivists to learn from the example and open up their digital collections for this type of quantitative research.

Dan has a blog where he commonly discusses these topics and also promised to post a primer on creating Google Earth KML files from archival datasets. Cool stuff!

I also once again took a ton of notes at this session, which you can find after the jump. They’re a great read!

SAA 2006: Possibilities and Problems of Digital History and Digital Collections – An Impressionistic Transcript.

(Note that I came in a bit late, so these notes don’t start at the beginning of the session. -DD)

Roy:

Who built America CD ? scandal over gay cowboys! (Apple refused to distribute it unless material on homosexuality was removed, but eventually backed down.)

Biggest censorship issues now take place outside the united states.(compare Google search on English versus Chinese Google for Tiananmen square.)

Another concern stems more from consumption than production. Can amateurs compete with corporations for attention on the web. Google for ?History? brings you to history channel first.

Issues related to this are not just about free beer. The other freedoms that Stahlman has outlined. Eventually we will want more than just read only access to historical content. Some things, like our syllabus finder, rely on our ability to access Google?s API. We need API?s for digital humanities content, not to rely on corporations. Significantly, academics have generally lagged behind others in embracing open access, populist enterprises, etc. Tended to view things like Wikipedia with disdain. Nor have humanities scholars shown much interest in making scholarship freely available.

Historians have no unique vision of the future ? they have enough trouble with the past. But we do need to take action to build open sources ? should be what academic and popular historians are doing. Join with archivists, librarians, others to promote information openness, access etc.

DAN COHEN:

We are thinking about what we need to do in the present to provide future historians with what they will need to document our history. We have to be considerably more proactive in the digital era. A growing part of our lives are lived online in a digital form. Opportunities to document human experience we have potential to capture. But digital is fragile ? we have examples in book of info that has gone away. We need to think now about how we will archive that stuff.

I?m cavalierly using the word archives. To avoid rotten tomatoes from the audience, I?ll say that we?re collecting. But I think our collections have some archival aspects.

Why collect history online? Just think about Sept. 11. Think about what happened. 100,000s of people who wrote email, who posted something to their blog, who blackberried to each other. There are tremendous resources here that historians will want in 50/100 years. Not just paper diaries, which are declining.

But websites change, people move on. The NYT online changed minute by minute. What should historians look at? If ?Dewey Defeats Truman? had been online, they would have changed it in a few seconds and we might never have seen it.

Along with LOC, Pew, Internet Archive, etc. we were able to save tons of content. See http://911da.org

A lot of material on the site was not born digital ? for example, things people posted on lamp posts that were then scanned and then uploaded. But more and more stuff is going digital ? even things like Skype. One of the advantages of digital media is capacity. Prices for hard drives coming down extremely rapidly ? will get better and better. We have the opportunity and possibility of saving quite a bit of material. I think the common view of NARA guide to archival science ? number one thing is to decide what you will throw out. At top of hierarchy is the Declaration of Independence, and at bottom is a Post-It. We have a friend who discovered a major historical find in a book of WWII rumors ? the Wikipedia of the day. With the capacity we can save a lot more.

Herodotus, first western historian, covered it all, talked to both Greeks and Persians. (quote) ?make mention of both alike? We don?t know what?s going to be important to who when ? to say which part of the web will be more important to the future.

This is not just an ethical decision, it?s a practical one, and I think we can do it. One quick example ? when I look back at our 9/11 site, we had people who came there to study teen slang, because we have a huge collection of writings from teens. I couldn?t have anticipated that in advance.

Project began with the Echo project, funded by the Sloan foundation. Gathering recollections in digital files on the history of science technology and industry. http://echo.gmu.edu. Working on this about 5 years ago ? about a year into it when 9/11 happened. After debating propriety of collecting 9/11 materials, we launched the archive based on some of the technologies developed for ECHO. This exceeded our expectations ? we ended up collecting 150,000 digital objects. Includes everything ? digital audio, blackberry communications, etc. In 2003, LoC took collection as first major digital accession and it is now a test case on preserving digital files.

We also have a dark archive of materials that won?t be released for 25 or 50 years. But the site still includes photos from around country and around world. Includes Photoshop art and other digital artwork that people did. We have scanned images that have been sent to us. More than 50,000 email addresses. We?re going to release a researchers interface in a few weeks that allows you to get deeply into the collection and study it as a scholar.

In coming years, these large, openly available digital collections will be common and usable in new ways: they have manipulability ? the ability to manipulate these collections. An initial example ? taking photos and stories that had 9 AM in a text string, found their locations, and mapped them onto a Google map. What was going on a few minutes after what happened. Allows you to pop up photos and text on map showing what happened at that time. This is the kind of opportunity of mixing things together that I?m talking about. http://911da.org/maps/ground_zero.php

Our next thinking about the needs of digital scholarship ? what to scholars want ? we realized we should have had better geolocation built into the system instead of having to extract that.

Similarly:

Hurricane digital memory bank
http://hurricanearchive.org/map_browse.php

If we had just asked to send stories without taking into account where they were, future scholars have fewer hooks to grab onto. So we asked people to pinpoint where they were before they contributed anything. One of the interesting things is that you can now automatically get latitude and longitude from addresses. So then we can have them upload multiple files about that location. So you can build up a collection this way.

One of the things I did for this talk ? wanted to push boundaries. So I did some extractions onto Google earth of 9/11 collection. What I?ve done here is looked for stories ? 4 slices of our collection. First was ?people who were watching CNN, people watching Fox news (which had just started), people who prayed, and people who listened to the radio. So what would historian do? For example, people who were watching CNN ? how did they react, what were they doing, where were they? I will make KML files available on Blog and write up how to do it.

Can also combine maps with also viewing digital files. So, for example, can see what artwork was like around Washington vs. New York. Meshing together a variety of sources. Can see a scholar in the future who has all these data resources and can move seamlessly between them.

Not just map based analysis. What can happen with open text archives?

For example, syllabus finder. allows you to aggregate sources scattered across the web. I did study where I downloaded 800 intro us history survey course syllabi and looked for what textbooks they assigned, grading system, other books, etc. using text analysis techniques. Because these resources are available, you can study the teaching of American history in colleges. Things I discovered, for example, heavy reliance on multiple choice. Looked at lots of additional reading on African American experience ? perhaps means people feel textbooks don?t cover this enough. Making things available to be aggregated like this so they can be data mined is really critical and opens up new avenues.

By nature, close reading of original source texts is an anecdotal method. There will be things in the future where you will have to do a wide variety of sources. For example, imagine historian studying Clinton White House who has to deal with 40 million e-mails. Compare that to the letter output of the Clinton White House. At 1 minute per e-mail, it would take 70 years to read all the messages. (assuming no coffee breaks.)

Even for someone like me who does Victorian history, I think there are opportunities to be less anecdotal. For example, putting together a database of Victorian letters from scientists and looking at text patterns. Sure I?ve looked at hundreds of them, but are they really representative of hundreds of thousands of scientists?

So what does this mean for you? As Roy suggested, no longer sufficient for archives to be independent, gated silos. If you want to be fully used, you?ll need to open up and provide ways for researchers to have different entrees into your collection.

Roy talked about APIs. (Application Programming Interfaces.) Come from IBM in 60s to allow other programmers access to some but not all of technology. I think what?s sad now is that a lot of the APIs we survey are commercial. There?s a reason for that – it?s a free service. Google has a few billion left over to fund this kind of thing. But for a small archive to do this, to commit resources to allow people to bypass their website and get at data, that?s a tall order.

But standards like RSS and other simple XML standards may help. Providing search results in XML in ways that can be grabbed and mapped. We?ll think of other ways. How will you make your collections available so that they can be used in new ways?

Finally, I think resources that are free to use are more valuable than those that are gated, even if they are limited. You may sneer at Wikipedia ? Stephen Colbert did a good job of that this week. But it?s valuable. Even if it?s run by cranks — we?ve actually downloaded the whole text to our hard drive and used it for research. For example, even if it?s wrong, I can use it to pull out relevant words on, say, evolution, and use them to analyze syllabi. That?s why the Googles and Yahoos are interested in supporting things like Wikipedia.

Of course I?d rather have perfect and open. And that?s where you guys come in. Your mission is to find ways to expose your data and your materials ? to not gate them and put them out for the public to use.

Q: Stumbling block is rights issues? How do you deal with that? Copyrights on things people send in to you, etc?

Dan: I think you can include it in your export of, say, Dublin Core metadata, rights info. You can include that as part of the data. People can steal things anyway.

Roy: There?s no good answer to that question. This answer isn?t one the George Mason counsel would sanction, but universities tend to worry too much about that stuff. What we?re doing is the essence of what fair use is for ? letting scholars do their work. I actually think that a lot of the cases they?re worried about we would win because of fair use. The number one thing that protects what we do is the fact that we can take stuff down if people complain. Compare that to a print publisher where you would have to physically recall books. And I do believe what we?re doing falls under the canons of fair use.

Q: When will historical methodology catch up with the potential of these types of research techniques? It seems like historical methodology has led to projects being scaled down to what will fit in one person?s head. Does this open up bigger opportunities to collaborate among historians and get at bigger issues?

Dan: Yes ? we have a large staff now, and we think a lot about collaboration.

We are going to release free software in September ? ?Scholar for Firefox? like Endnote, operates right within browser. Citation manger. Allows distributed scholarship. I work on Victorian politics. Part of this software allows tagging and annotation of documents, which helps share content between researchers working on similar topics. Hoping to get additional funding from IMLS to build an exchange server to facilitate this type of transfer.

Roy: Another angle is one of scale. For example, we suddenly have databases like JSTOR and Proquest Historical Newspapers. But yet we?re still using the same tools we used to use ? basically ?reading around? like we used to do in school. There was a brief rise and fall of quantitative social history ? but will it make a sort of comeback because people will be able to do this sort of thing. It?s no longer just census data.

One of the things we?re trying to work on is getting the tools to catch up so that non-programmers can do these things.

One of the people working at the center did an illegal hack to get the data he needed from the French national archives. This shouldn?t happen ? doctoral students shouldn?t have to hack to get the info they need for their research. So we don?t necessarily have the tools available to do this yet.

Q: Sept. 11 info ? you correctly described it as a collection ? something that was assembled. You?re describing a universe in which all info is created equal. How does archival appraisal fit into this? How does what you?re talking about affect the role of the archivist?

Dan: I think the example of the teen slang researchers? On the first anniversary, we had 13,000 people write stories on our site. Many were probably school assignments. You may think, well those aren?t worth as much as those of WTC 2 people.

But it?s important to note that you may save it all because you can. But that doesn?t mean you lose the filters. For example, Smithsonian said they didn?t want swear words, they wanted to have ?real? stories show up first, etc.

So maybe you should call what we?re doing storage ? we?re not making decisions, and people can do that later.

The whole database fits on a hard drive the size of a deck of cards. So no physical limitation. So nothing stops you from creating a curated version. But I still think its important that people in the future who have different ideas on how to use things be able to do that.

Roy: We?re in the raw business, not in the cooked business. The idea of layering, which is central to the internet, is a good way to think about this.

Q: Maybe, the concern is that the foundation you?re building is somewhat skewed because you?re driving the collection rather than letting the records collect naturally as part of organizational activities.

Dan: The thing I did in Google earth. If I were going to do that right, I?d have to normalize for things like amount of data collected in each zip code, the baseline of how many people pray, etc.

We admit that this was fast. But if you look at blogs, after 3 months, half are abandoned. You have to act now to get the content.

Roy: I think another critical layer there is the interpretive role of the historian. There was a chapter in my dissertation about the playground movement at the turn of the century. Wanted to find out what kids thought about this at the turn of the century. A reporter interviewed kids in the Worcester Telegram and Gazette, and it was a great source. But I had to think about whether the quotes were real, or whether the reporter just made them up. That?s the kind of thing historians do all the time. But the new tools offer some new quantitative ways for looking at things. And transparency is an important part of this.

Q: You?re doing a lot of things that archives aren?t yet doing in any systematic way. Can you talk about how to mobilize digital projeccts like the September 11 archive?

Roy: we?re pretty much a grant driven operation ? and in that way pretty opportunistic. We?re inclined to move quickly and try things out. One of our slogans is ?the perfect is the enemy of the good.? For better or worse, we made the decision to have all the people who do things with this on staff. (programmers, web designers, etc.) Most of those people have a history background, but that also enables us to move more quickly because we have all these people on staff who do all these aspects.

We felt like we needed to learn the technology and not just turn it over to other people. Unless you?ve worked on the programming, design, etc. it?s hard to think creatively about these things.

Q: Really interested in description process going along with collection? Did people describe things themselves?

Dan: for 9/11 project, we decided that in our na?ve wisdom on this, we want contributions. If you have a really long form ? more than a screen? If you look at web research, people leave if there?s more than one screen.

So we made decisions on what to emphasize. We wanted a location. We wanted a nice big text box ? we know that text boxes scroll, but some people don?t. Can we post your name? etc. And then below the submit button we ask about demographics. Research shows that if we put that stuff above the submit, we would get a lot less responses. And remarkably, 2/3 people did give us a zip code. People are very willing to give that, not so much phone number.

So we went into this with the idea that if we did too much we would turn people off. So we have names, email addresses, etc. (average email has 2 year lifespan, so we?ll lose people eventually) but it?s enough to do things with.

Q: Can either of you say anything about the recent federal legislation that has been proposed with regard to community websites?

Dan: umm, we?re against it? This is like the Internet is a series of tubes thing. 13 year olds know the Internet far better than their parents do, so these things are bound to fail. I think one of the things that is going on right now is that these sites? Where is the most self documentation going on? Most of it is happening on these big commercial sites like Flickr, Myspace, Yahoo groups, etc. Yahoo groups used to have a policy of deleting groups after six months! So you?re always at risk. So places like the Internet Archive that are saving MySpace pages are doing a real service.

Q: For many repositories, use fees are a valuable source of revenue. Others will give some access, but not high res images to maintain control. Your thoughts?

Roy: I?d love to know more concretely about how much user fees actually generate compared to the costs of administering them. Read in the Chronicle this week that the Met has now decided to make limited use of images available for free. Strikes me as a great thing. I want to know what the economics of this are, but it?s a crazy thing that one set of nonprofit academic institutions is charging another to use material. It?s not generating new dollars. My wife is chair of English department, and a young colleague was asked by an academic publisher to pay a fee to quote in his book from a book published by another university press. This seems to me to defeat the point off what university presses were set up to do. I?m a little skeptical about this.

Q: So you?ve been dodging the preservation issue. How to you propose dealing with it.

Dan: we do have a chapter in the book on this. What can ordinary people do? We provide some practical advice in the last chapter. Using international standards in terms of how things are stored, not wedded to a specific database, etc. In terms of text, images, etc. trying to use nonproprietary standards. Things will improve next year when Open Document Format is released and incorporated into Word, since that?s how many things are produced. Documentation is also really important. We have some practical advice for preparing and making it live longer.

Don?t have so much advice for the 50/100 year problem. One option since hard drives are big is to save things in multiple formats ? hopefully one will be readable in Photoshop 2025. We cede some of that to the computer scientists.

Q: We have the great collections from the 19th century because people collected stuff ? even if they didn?t know how to arrange or describe it! I want to commend you for doing this and creating a practical problem that archivists will hopefully then have to solve!