<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet href="render.xsl" type="text/xsl" ?>
<!DOCTYPE TEI  [
<!ENTITY mdash "&#x2014;">
]>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <!--
      $Revision: 1.3 $ last modified $Date: 2009/07/13 06:47:38 $
      by $Author: sigfrid $ 
      $Id: digital_humanities.xml,v 1.3 2009/07/13 06:47:38 sigfrid Exp $
  -->
  <teiHeader>
    <fileDesc>
      <titleStmt>
	<title>Digital Humanities Infrastructures</title>
      </titleStmt>
      <publicationStmt>
	<date>2008-06-12</date>
      </publicationStmt>
      <sourceDesc>
	<bibl/>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <front>
      <docTitle>
	<titlePart>Digital Humanities Infrastructures</titlePart>
      </docTitle>
      <docAuthor>
	<name>Sigfrid Lundberg</name>
	<address>
	  <addrLine>slu@kb.dk</addrLine>
	  <addrLine>Digital Development and Production</addrLine>
	  <addrLine>The Royal Library</addrLine>
	  <addrLine>Post box 2149</addrLine>
	  <addrLine>1016 Copenhagen K</addrLine>
	  <addrLine>Denmark</addrLine>
	</address>
      </docAuthor>
      <div type="abstract">
	<p>This note discusses different initiatives as regards
	providing access to resources and tools aimed at researchers
	within the humanities. Since the general breakthrough of
	search services like Google, annotation and bookmarking
	services like del.icio.us and the like, the attitudes towards
	the application of computing within the humanities has
	changed. In addition, the concept of e-science has contributed
	to make scholars positive to apply computing to various
	problems within the humanities.</p>
	<p>Following a discussion started around year 2000, we propose
	that it is possible to formalize many humanities computing
	tasks as pipelines between XML processing steps.</p>
      </div>
    </front>
    <body>
      <div>
	<div n="1">
	  <head>Introduction</head>

	  <p><ref target="#flanders01"><name>Flanders</name> (2005) </ref>
	  claimed in a recent article that the situation
	  for text analysis has changed; since the cultural position
	  of the computer has changed. Even scholars not involved in
	  computing are still computer users and as such consumers of
	  huge amount of net-borne digital data just by using
	  Google.</p>

	  <p>By taking advantage of Google's page rank in searching,
	  the scholar has become customer of a huge enterprise
	  employing the latest developments in hypertext, text and
	  language technology.</p>

	  <p>Flanders describes how humanities computing has changed
	  during the last decades; the scholars using these methods no
	  longer perceive that their graphs, tables and statistical
	  inferences unravels pure facts. Rather, their view of their
	  own methodologies has been tainted by the development of
	  arts and humanities at large. Having passed through research
	  styles such as hermeneutics, post-modernism, structuralism
	  and post-structuralism, the scholars involved in computing
	  <quote>have to not only seek and value the pattern that our tools can help us to see, but [they] have to be inquisitive about why these patterns seek us out, why we build tool to see them</quote>.
	  (<ref target="#flanders01"><name>Flanders</name>, op. cit.</ref>)</p>

	  <p>Following the changes in computer hardware, software and
	  cultural attitudes of scholars, we have seen a large number
	  of initiatives that aim at establishing collections
	  of tools as well as data for the benefit of
	  researchers within the arts and humanities. These collections
	  are often referred to as <emph> infrastructure</emph>. One
	  such initiative established just recently is the <ref
	  target="#dho01"> <name>Digital Humanities Observatory</name> </ref> 
	  in Ireland. The very name makes you
	  associate with a vantage point from which human heritage and
	  creativity through the ages can be observed, but it also has connotations:
	  An observatory is a very physical place, providing access to
	  sophisticated scientific instruments.<note>DHO is just
	  recently established. The organisation has recently hired
	  quite a few people mostly from the US all having documented
	  XML text encoding, metadata, software development and
	  digital library expertise. This gives an even better idea of
	  what they intend to do, than any mission statement.</note>
	  </p>

	  <p>The idea to establish extensive digital infrastructures
	  aimed at scholars within the humanities goes beyond earlier
	  initiatives. With the advent of e-Science, an infrastructure
	  implies <quote>... methods [that] enable new research by giving researchers access to resources held on widely-dispersed computers as though they were on their own desktops. The resources can include data collections, very large-scale computing resources, scientific instruments and high performance visualisation. </quote> (<ref
	  target="#researchcouncil">Research Councils</ref>)</p>

	</div>

	<div n="2">
	  <head>Digital Infrastructures</head>

	  <p>Traditionally, a digital infrastructure for the
	  humanities (or indeed any subject area) has meant services
	  like the UK based <title> Intute - the best Web resources for education and research </title> (<ref target="#intute">Intute</ref>) and <title> Arts &amp; Humanities Data service </title> (<ref target="#jiscahds">AHDS</ref>).
	  The former is a general
	  subject based information gateway (SBIG) whereas the latter
	  makes it possible for researchers and developers within
	  within archaeology, history, literature, language,
	  linguistics and performing and visual arts to deposit
	  documentation and data for the benefit of future
	  colleagues. Obviously, such infrastructures has included
	  services like on-line digital object archives, dictionaries and
	  encyclopedias.</p>

	  <p>Available support structures following the advent of
	  e-science on the World-wide Web differs. There are some very
	  interesting research into computer aided humanities
	  research, such as <title> the noraproject </title> (<ref
	  target="#noraproject"><name>Nora</name></ref>), <title> the monk project </title> (<ref
	  target="#monkproject"><name>Monk</name></ref>) and <title> Text analysis Portal for Research </title> (<ref target="#taportal"><name>TAPor</name></ref>).
	  The nora project produced (among other things) a web based
	  interface for text mining and automatic classification (<ref
	  target="#plaisant01">Plaisant et al., 2006</ref>), whereas
	  the monk project (Metadata Offer New Knowledge) seem to concentrate
	  on the interaction between microscopic properties of a text
	  and metadata of the same text. The Nora &amp; Monk projects
	  are cross-disciplinary projects involving expertise in text
	  mining, human computer interaction, information science and
	  various disciplines within the humanities. <ref
	  target="#taportal"><name>TAPoR</name></ref>,
	  builds upon a strong vision on the usefulness of text and language technologies in humanities research; users are allowed to create word frequency tables and KWIC concordances on the fly on texts submitted.<note>
	    The TaPoR is run by a consortium, and there are nodes
	    at several universities in Canada: <ref target="http://tapor.unb.ca/"> University of New Brunswick</ref>, <ref target="http://tapor.ualberta.ca/"> University of Alberta</ref>, <ref target="http://www.tapor.ca/"> McMaster University</ref>, <ref target="http://tapor.library.utoronto.ca/tapor2_content.html"> University of Toronto </ref> and <ref target="http://tapor.uvic.ca/"> University of Victoria</ref>.
	  </note>
	  See also <ref target="#rockwell01"> Rockwell (2003)</ref>.</p>

	</div>

	<div>
	  <head>What would scholars be doing when using an infrastructure?</head>

	  <p>In order to provide digital research support of the kind
	  envisioned by 
	  <ref target="#researchcouncil"> Research Councils</ref> (op. cit.),
	  one has to have a model of what the
	  tedious tasks are in the day-to-day work of a scholar and in
	  particular those that may be alleviated by computing. We do
	  not foresee computer aided innovation, inspiration &amp;
	  creativity and essay writing.</p>

	  <p>In general one may
	  (following <ref target="#mccarty"> McCarty, 2002</ref>)
	  divide the work of humanities computing into three fundamental
	  branches:
	  <note>The division is due to <ref target="#mccarty">McCarty (2002)</ref> and references therein; the examples are partly mine.</note></p>

	  <list type="ordered">
	    <item>algorithmic (e.g., calculation of collocation frequencies,
	    linguistic tagging such as parts of speech tagging or
	    lemmatization etc)</item>
	    <item>metalinguistic (e.g., tagging the structure of an
	    object that cannot be ascertained algorithmically)</item>
	    <item>representational (e.g., representing or transforming for
	    viewing or into a KWIC concordance)</item>
	  </list>

	  <p>The three are in my view not really on equal footing: The
	  representations are generally a product of algorithmic or
	  metalinguistic mark-up available. But the opposite is also
	  true analysis of collocation frequencies can obviously be
	  made in greater detail if there is good metalinguistic
	  markup.</p> 

	  <p>The starting point for
	  <ref target="#mccarty"> McCarty's (op. cit.) </ref> arguments is a short paper by <ref target="#scholarlyprimitives"> Unsworth (2000)</ref>,
	  who
	  investigates the idea that scholarly work can be sub-divided
	  into a finite list of primitive operations, or
	  <quote>self-understood</quote> functions. According his
	  original list, a scholar might be active</p>

	  <list type="bulleted">
	    <item>annotating</item>
	    <item>discovering</item>
	    <item>comparing</item>
	    <item>illustrating</item>
	    <item>referring</item>
	    <item>representing</item>
	    <item>sampling</item>
	  </list>

	  <p>resources.</p>

	  <p>However, Unsworth seems think that this list can be reduced to a very brief one, <quote rend="space">referring</quote>  and <quote>representing</quote>.
	  Be that as it may.
	  The list of functions will never be complete, even if an object oriented programmer almost certainly would agree with Unsworth.
	  The functions on the list should be recursive, in the sense that one should be able to pipe the functions together like basic Unix tools.
	  I.e., the output of one should be possible to use as input to the next.</p>

	  <p>Also McCarty's humanities computing research infrastructure (or
	  rather <quote>software</quote>&mdash;he never mentions
	  infrastructure) should support some list of primitive function but
	  they should be operating within <quote>[...]  a singular world-wide entity&mdash;a heterogeneous, geographically unconstrained working environment of mutually compatible data and software to which independent, otherwise uncoordinated efforts contribute</quote>. He <emph> does realize </emph> that the resources within this entity
	  should be modular and be <quote>in more or less standard format</quote>.</p>

	  <p>An example for putting flesh on the bones. Assume that Jens is
	  preparing a paper on Ludvig Holberg's shaping of male and female
	  characters.<note>This is a completely fictional example</note> The idea is that Holberg, being a male early 18th
	  century playwright, uses contemporary stereotypes as regards class
	  and gender to shape his characters and that these stereotypes are
	  manifested in his languages. In order to be understood these
	  stereotypes had to be easily recognized by by a contemporary audience
	  (who obviously should get a good laugh, or they wouldn't pay the
	  entrance fee).</p>

	  <list type="ordered">

	    <item>Jens creates an account at www.clarin.dk</item>

	    <item>and after configuring his workspace he starts by
	    <emph rend="bold"> discovering </emph> a set Holberg comedies.</item>

	    <item>He issues a query such that the database returns a document
	    <emph rend="bold"> referencing </emph> the names of the characters
	    within the cast
	    lists (<emph rend="italics">dramatis personae</emph>) of Holberg's complete works.</item>

	    <item>He then characterises the persons according to gender (male
	    and female) and class (lower, middle and upper) by <emph
	    rend="bold"> annotating </emph> the search result.</item>

	    <item>Having done this, he is then <emph
	    rend="bold"> annotating </emph> the plays such that they are
	    referencing the complete annotated dramatis personae list</item>

	    <item>Next step is the <emph rend="bold"> sampling </emph> of his
	    annotated version of Holberg's Works, such that he gets six new
	    documents, one for each class/gender category which contain speech
	    only.</item>

	    <item>Finally he is <emph rend='bold'> comparing </emph> these texts
	    using statistical and other methods using tools made available at
	    the www.clarin.dk web site.</item>
	  </list>

	  <p>My guess is that Jens would fail in this design. He will almost
	  certainly find the character's language is more affected by whether
	  they are speaking in the presence of people of lower or higher (or
	  both) ranked people. This is information available in the stage
	  notes in each scene.</p>

	  <p>Anyway, this might be regarded as positivist rantings from a
	  natural scientist. Humanists in Denmark may, or may not, like this
	  way of working, but this is how I understand how people work within
	  digital humanities and in particular how I understand Unsworth's and
	  McCarty's writings.</p>

	  <p>There are a few noteworthy things in this story: First (and
	  foremost), each step is standard XML processing in that we have a
	  collection of XML documents, and whatever Jens is doing is leading
	  to a new XML. Secondly, the result of each step is piped more or
	  less directly into the next step. That is: Each document, the
	  original collection as well as the derived one must be
	  <quote rend="space">in more or less standard format</quote> (McCarty, op. cit.),
	  that is they must belong to one of a few accepted schemas.</p>

	</div>

	<div>
	  <figure xml:id="figure01">
	    <graphic width="12cm" url="clarin_paper.png"/>
	    <head>Figure 1. Possible overall architecture of the clarin.dk
	    site, with the back-end aggregator systems shown. For the portal
	    architecture, please refer to Figure 2.</head>
	  </figure>

	</div>

	<div>

	  <head>Building a service</head>

	  <p>Merging the ideas quoted and presented in the previous sections,
	  we realize that the humanities computing infrastructure is
	  more or less the same as the World-wide Web. In many respects much
	  of what is envisioned in the literature is more or less what Tim
	  O'Reilly is evangelising in his set of papers and talks within the
	  linked from his web site <title> peer-to-Peer Networking, Web Services, and the Emergent Internet Operating System </title> (<ref target="#oreilly">O'Reilly</ref>).

	  He does not mention humanities resources, but his visions about the tools are similar.
	  Since many of these exists already, we will not start from scratch.</p>

	  <p>The Danish Clarin network will consist by member organisations,
	  whose staff is contributors to as well as users of the collections
	  maintained within the network. Some point within the network must&mdash;for practical&mdash;reasons be assigned the role of central hub (that is, users need somewhere to find information about the services provided).

	  Also, a central index helps when it comes to
	  provide fast and efficient resource discovery. The network must have
	  facilities for for dissemination of metadata. This site will also be
	  the home of user contributed data and possibly also mirrors of the
	  other repositories (<ref target="#figure01">Figure 1</ref>).</p>

	  <p>There will be a need for a set of protocols for metadata
	  harvesting as well as for distributed searching. At this
	  time, the most promising candidates for these functions are <title> Protocol for Metadata Harvesting </title> (<ref target="#oaipmh">Open Archives Initiative</ref>), <title> Object Reuse and Exchange </title> (<ref target="#oaiore">Open Archives Initiative</ref>) and <title> OpenSearch/1.1/Draft 3 </title> (<ref target="#opensearch">OpenSearch</ref>), respectively. Mirroring (if at all) will be made by HTTP based on the data revealed via OAI-PMH by member sites.
	  Members without database driven access to data will have to provide data through other means,
	  for example manual file upload.</p>

	  <figure xml:id="figure02">
	    <graphic width="12cm" url="clarin_portal.png"/>
	    <head>Figure 2. Possible architecture of the clarin.dk
	    portal.</head>
	  </figure>

	  <p>The portal's architecture is depicted in
	  <ref target="#figure02"> Figure 2</ref>. The main function in the stack of operations discussed above is <emph rend="bold"> discovering</emph>.
	  
	  For authenticated users, we foresee that all transactions
	  between the system will be in a standardised XML format
	  rather than HTML. In this vision the rendition of documents
	  is performed in the user's web client. The user will thus be
	  able to process any documents delivered with his or her
	  favourite native XML tools.

	  We propose that the portal is built as an engine based on <title> TEI lite </title> (<ref target="#teilite">Burnard &amp; Sperberg-McQueen, 2006</ref>). It is possible to produce very sophisticated hypertext documents in this format for direct rendition client side (the current document is an example of this).</p>

	  <p>Given a standard, modular and extensible method to
	  implement and document XML processing pipelines, we are
	  confident that the pipeline stack of Figure 2 is
	  implementable. There are systems available suitable for the
	  purpose, the most promising one is <title> XProc, An XML Pipeline Language </title> (<ref target="#xproc">Walsh et al., 2008)</ref>.

	  A conforming XProc processor accepts one or more XML documents as input and produces the same
	  as output.

	  It should support XSLT 1.0 and 2.0 for transformation,
	  XPath 2.0 and XQuery 1.0 for querying.
	  XQuery may be used for transformation as well as querying.
	  With this arsenal we envisage that the whole set Unsworth's functions can be implemented without programming in the meaning that staff will have to write an extensive amount of (say) Java code to implement the individual pipes.</p>

	</div>

      </div>
    </body>
    <back>
      <div type="bibliogr">
	<listBibl>
	  <head>References</head>

	  <bibl xml:id="oaipmh">
	    <author>Open Archives Initiative</author>
	    <title level="m">Protocol for Metadata Harvesting</title>
	    <ref target="http://www.openarchives.org/ore/">http://www.openarchives.org/ore/</ref>
	  </bibl>

	  <bibl xml:id="oaiore">
	    <author>Open Archives Initiative</author>
	    <title level="m">Object Reuse and Exchange</title>
	    <ref target="http://www.openarchives.org/ore/">http://www.openarchives.org/ore/</ref>
	  </bibl>

	  <bibl xml:id="opensearch">
	    <author>Open Search</author>
	    <title level="m">OpenSearch/1.1/Draft 3</title>
	    <ref target="http://www.opensearch.org/Specifications/OpenSearch/1.1">http://www.opensearch.org/Specifications/OpenSearch/1.1</ref>
	  </bibl>

	  <bibl xml:id="xproc">
	    <editor>Walsh, Norman</editor>
	    <editor>Alex Milowski</editor>
	    <editor>Henry S. Thompson</editor>
	    <date>2008</date>
	    <title level="m">XProc: An XML Pipeline Language</title>
	    <ref target="http://www.w3.org/TR/xproc/">http://www.w3.org/TR/xproc/</ref>
	  </bibl>

	  <bibl xml:id="teilite">
	    <author>Burnard, Lou</author>
	    <author>C. M. Sperberg-McQueen</author>
	    <date>2006</date>
	    <title level="m">TEI Lite: Encoding for Interchange: an introduction to the TEI Revised for TEI P5 release</title>
	    <ref target="http://www.tei-c.org/release/doc/tei-p5-exemplars/html/teilite.doc.html">http://www.tei-c.org/release/doc/tei-p5-exemplars/html/teilite.doc.html</ref>
	  </bibl>
	  
	  <bibl xml:id="flanders01">
	    <author>Flanders, Julia</author>
	    <title level="a">Detailism, Digital Texts, and the
	    Problem of Pedantry.</title>
	    <title level="j">TEXT Technology</title>
	    <date>2005</date>
	    <biblScope type="volume">14</biblScope>
	    <biblScope type="number">2</biblScope>
	    <biblScope type="pp">41-70</biblScope>
	    <ref target="http://texttechnology.mcmaster.ca/pdf/vol14_2/flanders14-2.pdf">http://texttechnology.mcmaster.ca/pdf/vol14_2/flanders14-2.pdf</ref>
	  </bibl>

	  <bibl xml:id="dho01">
	    <title level="m">Digital Humanities Observatory</title>
	    <ref target="http://www.dho.ie/">http://www.dho.ie/</ref>
	  </bibl>

	  <bibl xml:id="plaisant01">
	    <author>Plaisant, Catherine</author>
	    <author>James Rose</author>
	    <author>Bei Yu</author>
	    <author>Loretta Auvil</author>
	    <author>Matthew G. Kirshenbaum</author>
	    <author>Martha Nell Smith</author>
	    <author>Tanya Clement</author>
	    <author>Greg Lord</author>
	    <date>2006</date>
	    <title level="a">Exploring Erotics in Emily Dickinson's
	    Correspondence with Text Mining and Visual
	    Interfaces.</title>
	    <title level="j">Proceedings of the 6th ACM/IEEE-CS
	    joint conference on Digital libraries (JCDL'06)</title>
	    <ref target="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.3281">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.3281</ref>
	  </bibl>

	  <bibl xml:id="mccarty">
	    <author>McCarty, Willard</author>
	    <title level="a">Humanities Computing: Essential Problems, Experimental Practise.</title>
	    <title level="j">Literary and Linguistic Computing</title>
	    <date>2002</date>
	    <biblScope type="volume">17</biblScope>
	    <biblScope type="number">1</biblScope>
	    <biblScope type="pp">103-125</biblScope>
	    <ref target="http://llc.oxfordjournals.org/cgi/content/abstract/17/1/103">http://llc.oxfordjournals.org/cgi/content/abstract/17/1/103</ref>
	  </bibl>

	  <bibl xml:id="oreilly">
	    <author>O'Reilly, Timothy</author>
	    <title
	    level="m">Peer-to-Peer Networking, Web Services, and the
	    Emergent Internet Operating System</title>
	    <ref target="http://tim.oreilly.com/p2p/index.csp">http://tim.oreilly.com/p2p/index.csp</ref>
	  </bibl>

	  <bibl xml:id="scholarlyprimitives">
	    <author>Unsworth, John</author>
	    <date>2000</date>
	    <title level="a">Scholarly Primitives: what methods do
	    humanities researches have in common, and how might our
	    tools reflect this?</title>
	    <title level="j">Part of a symposium on "Humanities Computing: formal methods,
	    experimental practise" sponsored by King's College,
	    London, May 13, 2000</title>
	    <ref target="http://www.iath.virginia.edu/~jmu2m/Kings.5-00/primitives.html">http://www.iath.virginia.edu/~jmu2m/Kings.5-00/primitives.html</ref>
	  </bibl>

	  <bibl xml:id="rockwell01">
	    <author>Rockwell, Geoffrey</author>
	    <title level="a">What is Text Analysis, Really?</title>
	    <date>2003</date>
	    <title level="j">Literary and Linguistic Computing</title>
	    <biblScope type="volume">18</biblScope>
	    <biblScope type="number">2</biblScope>
	    <biblScope type="pp">209-219</biblScope>
	    <ref target="http://llc.oxfordjournals.org/cgi/content/abstract/18/2/209">http://llc.oxfordjournals.org/cgi/content/abstract/18/2/209</ref>
	  </bibl>

	  <bibl xml:id="researchcouncil">
	    <author>Research Councils [the UK]</author>
	    <title level="m">What is e-Science?</title>
	    <ref target="http://www.rcuk.ac.uk/escience/default.htm#phMain">
	      http://www.rcuk.ac.uk/escience/default.htm#phMain
	    </ref>
	  </bibl>
	  <bibl xml:id="intute">
	    <title level="m">Intute - the best Web resources for education and research</title>
	    <ref target="http://www.intute.ac.uk/">http://www.intute.ac.uk/</ref>
	  </bibl>

	  <bibl xml:id="jiscahds">
	    <title level="m">Arts and Humanities Data Service</title>
	    <ref target="http://ahds.ac.uk/">http://ahds.ac.uk/</ref>
	  </bibl>

	  <bibl xml:id="noraproject">
	    <title level="m">Nora project</title>
	    <ref target="http://www.noraproject.org/">http://www.noraproject.org/</ref>
	  </bibl>

	  <bibl xml:id="monkproject">
	    <title level="m">Monk project</title>
	    <ref target="http://www.monkproject.org/">http://www.monkproject.org/</ref>
	  </bibl>

	  <bibl xml:id="taportal">
	    <title level="m">TAPoR - Text Analysis Portal for
	    Research</title>
	    <ref target="http://portal.tapor.ca/">http://portal.tapor.ca/</ref>
	  </bibl>

	</listBibl>
      </div>
    </back>
  </text>
</TEI>

