Text-mining is a method that uses tools to see trends
in data sets that are too large to search through conventional means, otherwise
known as Big Data. These tools use computer technology to read the vast amounts
of data for whatever input given, usually a word or phrase, and processes the
findings in some presentable way. A lot of different tools present their
findings in different ways from word clouds to graphs and each have their uses.
The tools can be divided into two groups, with one focusing on searching a
particular document, while others focus on searching a database full of documents.
The tool I choose to use was Google’s Ngram program. I
wanted to track the nomenclature of the area I have started studying,
underwater archaeology. The field is split amongst different terms each with
its own implicit meaning, nautical archaeology and maritime archaeology relating
to ship archaeology, marine archaeology referring to ocean related archaeology,
and underwater archaeology which is an all-encompassing term of all archaeology
done under water, however a ship focused archaeological term does not necessarily
have to always be underwater. There has always been a debate surrounding which
term to use when referring to the field, which there is no shortage of
participants in the field taking up their own preferred term over others. I
wanted to apply this nomenclature to Google’s Ngram program and see which term comes
up the most often on Google Scholar.
The parameters for the search are set from the years
1800 to 2000 and the terms used for the search are as follows: maritime archaeology,
nautical archaeology, underwater archaeology, marine archaeology. The chart that
follows was particularly interesting as it showed underwater archaeology as the
most dominate term to get hits within the database followed by maritime,
marine, and nautical. Nautical archaeology was the earliest term appearing with
a particular uptick in the 1930s before dropping off. The next significant uptick
of all terms comes with the advent of the SCUBA unit in the 1950s as one might naturally
think as the SCUBA unit is what allowed this exploration of submerged cultural
remains. The latest term to appear with significant usage is maritime
archaeology which doesn’t start to gain significant usage until 1974 and gradually
out paces all other terms save for underwater archaeology. Interestingly enough,
I changed the search parameters to up to the latest year Ngram allows which is
2008. This increase of eight years of content actually caused a dramatic shift in
the findings as underwater archaeology drastically drops off and falls behind maritime
archaeology. In fact, all terms are on a decline of usage from 2000 to 2008 almost
as much of a decline seen in the early 1990s, which serve as a universal low
point in the usage of these terms since the invention of the SCUBA unit.
Furthermore, this tool allows you to click and search
on the terms and see the hits that were provided. A few of the searches for the
terms based on time parameters (such as 1800-1950) provide a list of documents
and when one is selected a webpage with a scan of every page with the searched
term on it within the document is displayed which allows for the not only where
the term appears within the selected text, but its context as well. Additionally,
a word cloud is even provided based on the text with other similar words and
their frequency and if you click on one of the words in the word cloud the page
changes its searched term in the text to show scans of the chosen term with its
every appearance and context. This is incredibly helpful as it helps defeat one
of the criticisms of Big Data which is its lack of transparency and lack of
transition from far reading to close reading. However, these extra features
such as the word frequency and word clouds do not appear for most texts and this
does seem to be in the minority of searches I have done.
Overall, I think using Ngram served my purposes well. I
can easily see which terms were the most popular in relation to each other and the
tool was simple to use. The ability that Ngram can display word clouds and show text searchable pdfs with every
hit of that term within the text all on the same page makes this program have
the potential to satisfy a lot the workflow needs when it comes to Big Data. The
trend in relation to SCUBA diving is clear to see and if I were to dedicate
more time to exploring this concept, I am sure there is more to be gleaned such
as the sparse mentions of nautical archaeology before the creation of the SCUBA
unit and the downward trend in the early 1990s and post 2000s.
No comments:
Post a Comment