Skip to Main Content

     

Digital Humanities

Here you will find a guide to Digital Humanities resources.

What is XPath?

  • XPath stands for "XML path language"
  • A query language used to search for data within a larger XML system
  • It is standard way to find specific XML elements
  • If you imagine the XML document as a tree, you can use XPath to navigate the tree and get to the leaves you need
  • The expressions are called paths - imagine you are telling the computer the path you want to take to get to the elements you are searching for

XML Summary

​​​

<div type="dedicatoryEpistle">

     <p> Hi </p>

</div>

  • XML (eXtensible Markup Language) is a way to mark up or annotate text in a way that can easily be expanded (extensible)
  • Elements in XML are mostly structured within each other
  • An element is anything that is contained by a tag like this: <tag> I am an element </tag>
  • A tag begins with its name surrounded by <> and ends with its name surrounded by </>

  • We can use XPath to find the elements we want based on the relationship elements have with each other and the data associated with them

  • div is an element node
  • type would be an attribute node

Simple XPath

<root>

     <div type="dedicatoryEpistle">

          <salute> Hi </salute>

          <p> Hi </p>

     </div>

     <p> Hello </p>

</root>

  • Start expressions with / or // and have a tag name after it - you can chain them together

  • Starting the expression with / means it will find elements starting from the root - the root is simply the element that contains every other element (it can be called anything)
    • If search for salute with /salute, we won't find it because it's not directly under the root
    • We need to do /root/div/salute and then we will find it - exact location
    • This is called the absolute path (the exact location relative to the root)

  • Starting with // means it will locate elements no matter where they are in the document - better for searching
    • //salute will find salute no matter where
    • //p will get both the <p> inside and outside the div
    • This is called the relative path because it can be relative to anything

Relationships

  • Elements can be related to each other in many ways when using XML
  • An element can be a:
    • Parent - The element that directly contains the current element (1 level up)
    • Ancestor - Anything that is containing the current element or its parent (many levels up)
    • Child - Anything that the current element directly contains (1 level within)
    • Sibling - Any element next to the current element (or is contained by the same parent as the current element)
    • Descendant - Anything that the current element or its children contain (many levels within)
 

Relationship Examples:

Parent:

<root>

     <div type="dedicatoryEpistle">

          <salute> Hi </salute>

          <p> Hi </p>

     </div>

     <p> Hello </p>

</root>

  • div is the parent of salute
  • It is one level above salute

Ancestor:

<root>

     <div type="dedicatoryEpistle">

          <salute> Hi </salute>

          <p> Hi </p>

     </div>

     <p> Hello </p>

</root>

  • root and div are both ancestors of salute
  • They are one or more levels above salute and root contains the parents of salute
  • Ancestors can be thought of the parents + its parents and so on

 

Child:

<root>

     <div type="dedicatoryEpistle">

          <salute> Hi </salute>

          <p> Hi </p>

     </div>

     <p> Hello </p>

</root>

  • div is the child of the root
  • It is one level within the root

Sibling:

<root>

     <div type="dedicatoryEpistle">

          <salute> Hi </salute>

          <p> Hi </p>

     </div>

     <p> Hello </p>

</root>

  • div and p are siblings because they are next to each other
  • They also share the same parent, the root

 

Descendant:

<root>

     <div type="dedicatoryEpistle">

          <salute> Hi </salute>

          <p> Hi </p>

     </div>

     <p> Hello </p>

</root>

  • div, p, and everything inside them are descendants of the root
  • salute, and the other p are all children of the children and are multiple levels within
  • Descendants can be thought of as children + their children and so on

Axes

  • We can use the relationships elements have with each other to tell XPath exactly what elements we're looking at
  • These ways to specify are called XPath axes
  • Each relationship has one or more axis associated with it

 

  • We write axes like this - axis::target-elements
    • axis is the type of relationship we want to look at
    • target-elements are the elements we want to select which have that relationship
    • Use "::" in between them
    • They go in the middle of the normal path expressions

  • parent - parent (can be abbreviated with ".." in some cases)
  • child - child
  • ancestor
    • ancestor - ignore the current element
    • ancestor-or-self - include it
  • descendent
    • descendent - ignore the current element
    • descendent-or-self - include it
  • sibling
    • preceding-sibling - only the siblings before an element
    • following-sibling - only the siblings after an element
  • There are a few more but they are not part of the main relationships and can be called on using different and more convenient notation

Axes Examples

  • //persName/ancestor::p
    • This will search the whole document for <persName> tags and then select all the ancestors that are <p> tags
  • //p/child::note
    • This will find all the <p> tags and then select only the <note> children
    • You don't need to use this expression, of course there is a simpler way to get the note elements (//note)

Specificity

  • Many of these expressions give you multiple results (a list of results)
  • How can we find the specific one we are looking for if we need only one?
  • We can target elements with certain properties by using [ ] to the right of a tag name
  • In the square brackets, we can specify a number starting with 1 to pick the element from the list if we know which we want
    • /TEI[1]/text[1]/body[1]/div[1]/p[18]/persName[5]
    • This is the absolute location of a <persName> tag by specifying that it's inside the 1st <TEI>, the 1st <text>, the 1st <body>, the 1st <div>, the 18th <p> and it will be the 5th <persName>
  • We can also specify further which elements we want by picking certain attribute values if we don't know the exact numbers

  • We specify that we are talking about an attribute in the square brackets [ ] by using the @ symbol

    • //p[18]/persName[@type="hist"]
    • This will find the 18th <p> then pick all the <persName> tags that have a "type" attribute with the value "hist"
    • In other words, all the historical figures in the 18th paragraph
  • The square brackets can be chained to further refine the result

    • //p[18]/persName[@type="hist"][1]
    • This will take the list above and pick the first <persName> in it
    • So, the first historical figure in that paragraph
  • You can do this with any attribute and value

Try it Yourself

  • Those are all the main techniques needed to use XPath
  • We recommend using these techniques to design better queries to get the data you need for research!

Resources to Learn More XPath! - Informal Bibliography

  1. W3Schools XPath guide - a good overview of everything: https://www.w3schools.com/xml/xpath_intro.asp
  2. Axes guide - describes the relationships and axes we talked about: https://youtu.be/aAWvwGFkySI
  3. Mozilla XPath axes - list of all the axes and what they do by the developers of Firefox: https://developer.mozilla.org/en-US/docs/Web/XPath/Axes
  4. Mozilla XPath functions guide - we didn't talk about this, but lists functions which can be used in the square brackets: https://developer.mozilla.org/en-US/docs/Web/XPath/Functions

MLA Bibliography

CitizenK, Cuimingda, Dria, ExE-Boss, Fredchat, Jonnyq, mete0r, Mgjbot, Nickolay, Ptak82, Sheppy, & SphinxKnight. (n.d.). Axes. Axes - XPath | MDN. Retrieved December 13, 2021, from https://developer.mozilla.org/en-US/docs/Web/XPath/Axes.

CitizenK, Cuimingda, Dria, ExE-Boss, Fredchat, Jonnyq, mete0r, Mgjbot, Nickolay, Ptak82, Sheppy, & SphinxKnight. (n.d.). Functions. Functions - XPath | MDN. Retrieved December 13, 2021, from https://developer.mozilla.org/en-US/docs/Web/XPath/Functions.

W3Schools. (n.d.). XPath Tutorial. Retrieved December 13, 2021, from https://www.w3schools.com/xml/xpath_intro.asp.

"XPath Axes - ancestor, parent, following-sibling, preceding-sibling, child, descendant." YouTube, uploaded by H Y R Tutorials, December 13, 2021, https://youtu.be/aAWvwGFkySI

Contributors

Fall 2021 Honors Digital Humanities Class: Nick Ribeiro and Ellie Lynch

Spring 2022 Library Digital Humanities Intern: Meeghan Bresnahan