DSpace

Originally uploaded by D-.

After spending lunch at a roundtable targeted at Student members of SAA, I headed over to the International Ballroom for a session on DSpace, an open-source digital repository system written in Java. The system is now in use at a large number of institutions nationwide (and even worldwide.)

Two of the panelists were from MIT, which originally designed the system in conjunction with Hewlett-Packard. The third was from the Kansas State Historical Society, which is using DSpace to store an archive of digital reports submitted to the state legislature. (See KSpace)

The system has the benefit of being free, open source software, but it also has a reputation of requiring a great deal of IT support to maintain. The MIT folks emphasized that the policy decisions are much harder than making the IT work. But then again, we don’t all have the resources of MIT. Veatch (from Kansas) said that having an IT person leave had made it difficult to make progress on the project because of the skill level needed to do customization and modifcation.

Overall, this was a useful introduction to the system. I was aware of DSpace prior to this, but the session gave me a better understanding of how it’s being used in the real world. I’d like to try getting a copy up and running on a server in the Tech Lab back at GSLIS so that we can play around with it a bit. Hmm, maybe a project for this fall…

As before, see below for another impressionistic transcript…


DSpace and its Implementations: An Impressionistic Transcript:

MacKenzie Smith, MIT Media Laboratories.
Tom Rosko, Institute Archivist, MIT
Matt Veatch, State Archivist, Kansas State Historical Society

General intro to DSpace: open source app (since 2002)
?institutional repository platform? ? ?terrible term, because no one understands what it means.?
Voluntary federation of digital repositories run by many academic research institutions > 150 now. ? has some benefits ? for example, Google Scholar picks up all objects in registered DSpace Repositories. Development of system is complicated because different sites have very different ideas on what it should do.

Originally designed to be optimal for papers and publications, but same technology can be useful for a variety of other digital media ? although it falls a part a bit for super complex digital objects.

Communities, sub communities, collections, and items. (Hierarchy of objects.)

Open Source ? free, but TCO is not zero! Need hardware, IT experts, people to make changes, upgrades, etc. Smith believes it?s still slightly cheaper than a commercial vendor because vendors are too expensive. But the real benefit is being able to adapt if you have the skills available. (Commercial vendors are now starting to offer hosting and support for DSpace, though.)

Governance structure ? loosely based on apache.org. A non-profit foundation is now being established to deal with intellectual property issues and other problems.

The platform is developing very rapidly, with rapid evolution and new releases every 6-9 months.

?within 10 years, every research university will have a digital repository based in the library or archives.?

– MIT now has 10,000 e-theses online. These are the biggest single type of content in repositories now, because everyone has them and no one knows what to do with them.
– Some places also using it for publishing. DPubS ? an e-journal publishing system that can be laid over DSpace or fedora.
– Research data sets ? like XML snippets representing molecules.
– Learning objects ? developed for use in online courseware systems ? should be preserved. (For example, MIT?s open courseware project.)

Most sites look similar, but you can customize the look and feel. Also some international adoption ? for example ? a large library in Brazil. (Biblioteca ___)

U of R has developed an entirely new front-end for the system. Duke now using it for student portfolios! You get certain things with it, but if you have the resources you can expand it to do whatever you want.

Big topic of contention re: policies: what is a community in DSpace? How do you define, what responsibilities does community take on? Decisions about content, access, who inputs metadata? What are the repository?s responsibilities to the community? What are the library?s rights (what if a community disappears??) Content guidelines? If determined by the community, raises issues relative to appraisal and collection development. So some places to appraisal centrally.

MIT offers guidelines of types of content, but will accept anything.

Need to determine access policies. MIT DSpace requires all content to at least be available to the MIT community.

Metadata policies. MIT assumes that submitters will provide metadata, and they usually do. But who does quality control and resolves conflicts? You have to look for other solutions other than hand-crafted item level data designed by an archivist, since there is such a volume of content. User-supplied metadata? (The Wikipedia approach to metadata.)

Can providers remove items? MIT says no ? they will hide something, but it still remains. BUT ? after 30 years the administrators can remove content if they feel it?s no longer worthy of preservation.

Preservation policies: for high risk formats, they don?t agree to do anything more than preserving the bits. Whereas, for common formats like PDF/A they?re pretty confident they can offer preservation. DSpace can?t figure this out for you ? ?preservation is a human activity.?

Physical stewardship: you might describe something even though all of the content isn?t physically in the system ? for example, linked on a faculty web page. So you have to be careful.

IP issues ? very complicated ? some donors may not have copyright, but have right to give it to you. Have to make sure you have enough rights to do something useful. On the flipside, have to tell the public what they can do with content.

Libraries, archives, other institutions are merging in terms of the technologies they need, so it?s not smart for each discipline to continue doing everything the way they always have. Doesn?t make sense to have ?the archives platform? and ?the library platform.? Where the underlying platforms are 80% the same.

TOM ROSKO:

DSpace@MIT ? research and teaching output of MIT. Doesn?t necessarily fit with what?s in the archives. Was set up originally to promote the idea of faculty ownership rather than library initiative. But this is difficult, because archives sees so many opportunities for what it could be.

Less stuff is coming into the archives in paper format. So they are going to go out and start proactively gathering content.

Other _Spaces (on the drawing board):

Because DSpace is ?faculty owned,? doesn?t entirely serve the needs of the Archives. Hence, new ideas:

XSpace: (Digital Library)
– Digital library content (non-MIT created)
– Digitized Collections
ASpace (Digital Archives)
– Administrative records
– Manuscript collections (faculty papers)
– MIT ?non-research? publications
– Other MIT-related materials
o News office photos, video, websites

But? this presents issues ? there is overlap with DSpace (faculty research, for example.) Also raises issues for archival needs, such as retention schedules or access restrictions ? these may not currently be fully supported in the DSpace software. Can be hard to map collection structure of records. (DSpace can theoretically do some of these things, but MIT has not implemented them.)

General DSpace issues at MIT:

?technical stuff is the easy stuff!? It?s the policies and procedures that are hard.

– multiple stakeholders
o MIT libraries, MIT, researchers
– Multi-level decision making process
– Collection Development
o Who is liasing with content providers?
– Communication ? unified message of what _Space is and isn?t
– Administrative and faculty policies
o Access policies
– Intellectual Property
o Ownership of material and rights
– DSpace Policy
o Do items fir DSpace criteria?
o What community ?owns? them?
– Withdrawals/Transfers of ?ownership?
– Procedural/workflow issues: complicated in the paper world, and still complicated in digital world. How does, say, a thesis make its way through the organization and end up in the archives.

DSpace object can be one file, or a thousand file website. Makes it harder to make policy.

Matt Veatch
State Archivist, Kansas State Historical Society

?KSpace? ? an implementation in a state government environment. Catalyst was 2002 law authorizing electronic submission of reports to Kansas legislature
KSHS and State Library of Kansas have statutory responsibility to collect and presser state government publications. They had preservation concerns about when law was being considered, and were told to ?solve the problem?

Solution a collaborative project between KSHS, State Library, and all three branches of government.

Initial goal: create digital repository to preserve and provide access to ____

Decision to proceed with DSpace driven by functionality, budgetary constraints, comfort with open source (?culture is open source based? ? this presentation is currently running on OpenOffice! We like the flexibility and adaptability of such products.) software, and technical staff. (We had at least one person on staff with the confidence to say we could do this. Tom and Mac. Say that the policy issues are hard, tech is easy. That?s true to an extent, but there can be significant technical hurdles. By we felt like we could do it.)

Initial proof of concept was running on a workstation under the web programmer?s desk ? if he kicked it wrong, it went down. Showed it to a variety of people in government ? successful. So planned a pilot implementation, but needed funding for this. Applied for a grant from the Information Network of Kansas. (Kansas.gov web portal). Run by a private entity, and they make a profit on things like licenses. So they are required to reinvest in projects. We asked for $50k, and spent about $40 ? turned some back. Money was mainly for hardware ?bought servers, firewalls, backup services, a bit off marketing. But didn?t spend a ton of money.

Generally followed DSpace planning and implementation guidance based on experiences elsewhere. (project team, policy advisory group, technical advisory group.)

Not a lot of customization, but submission screens and graphic layout changed. Mainly to simplify for busy state employees ? they?re not going to provide a ton of metadata.. Worked with targeted agencies on content submission to develop training and memoranda of understanding.

Key decisions:
– Accept a limited number of file formats
o Mitigate future preservation issues
o Based on FL Center for Library Automation
o Three support levels ? preferred (XML, PDF/A, etc.), acceptable (PDF, etc.), unsupported (Microsoft Word, Microsoft Powerpoint, Microsoft Anything!)
– Failover system to maintain access
o Duplicate server, firewall at offsite location
– Contract for backup services.

Pilot launched in late 2004 ? narrowed scope to focus on reports on submitted to Kansas legislature. Manual process of getting them all in in 2005 ? way too manual and not scalable, but got them all in. Now working on 2006 reports.

In our definition of terms, ?State Agency? becomes a DSpace community. Then you might subdivide beyond that. ? sub sub community might be ?office of the secretary.?

Lessons learned:

– Technical staff is key ? we lost the programmer who did implementation, and that has made it very difficult. It is written in server side java, which is required to do customization. So despite what Tom said, having some technical staff available is important.
– MOU process is time consuming.
– Manual submission by agencies is not feasible ? automated harvest tools needed. Unlike faculty, state employees don?t have the motive to preserve their work for posterity.

Went to workshop in Baltimore, use tool developed but LoC center for technology in Government to assess digital preservation capability. Need to look at long term planning issues like scope, resources needed for maintenance and expansion, risk assessment, stakeholder analysis, evaluation criteria, automate content acquisition.

Phase 2: Capture publications on agency websites (definition of ?publication? interpreted broadly

Working on automated content acquisition
– looking at ?web archives workbench? a tool being developed on a grant by OCLC to help capture web content. (session on Saturday morning with Judy Kopp (sp?).) Looking at how this can be integrated with a digital repository.

Anticipate doing some pilot tests of non-web records within the next 12 months. Will require more systematic approach, but we think system can handle it. ?not a true electronic records system, but it?s what we?ve got.?

Long term maintenance and expansion ? we want KSpace recognized as an ?enterprise application? in the state government, which would ensure funding stream from other agencies.

www.kspace.org
[email protected]

Q: (From UCLA Archivist0 How many faculty at MIT? A: 900ish

Q: (From UCLA Archivist) Authenticity of records? —
A: Veatch ? that?s one of the reasons we haven?t done records yet. How do you make sure something doesn?t change over time. If we?re going to use it for state records, we need ingest procedures, with good metadata so that we can be convinced that we can preserve authenticity.

MIT also has issues with authenticity because of the faculty control. At MIT, students can submit own thesis, in addition to the official copy (?) raises questions.

Q: University of WI at Madison guy ? going to be putting time schedules of classes, etc. in repository because no longer being printed in paper. Are you planning on doing those types of large pubs?
A: Course catalog produced by a unit of presidents office. Half owned by registrar?s office.

U. of Wash. Is using it to preserve LDAP directory.

Q: University of Ca Irvine: KS is restricting formats, MIT is not. How does MIT deal with delivery of all those formats.
A: Mackenzie ? we don?t deal with it. Content can be viewed in browser using whatever method the browser has defined for that kind of content. Must have software to view it on their own machine. (IE, if they send word doc, then people have to have word doc to read it.) Later we?ll probably look at providing alternate formats, but DSpace separates storage of content/rendering.

Q: Grant?
A: MIT?s grant from HP ran out four years ago. Now a broad based open source project. Trying to build something that?s sustainable without grant money, unlike a lot of other projects out there. We?ve been off that crutch for four years, and things are working fine. The basic platform is supported by the community of users.