Tuesday, September 17, 2019

First attempt at Text-Mining. What is Big Data?


Text-mining is a method that uses tools to see trends in data sets that are too large to search through conventional means, otherwise known as Big Data. These tools use computer technology to read the vast amounts of data for whatever input given, usually a word or phrase, and processes the findings in some presentable way. A lot of different tools present their findings in different ways from word clouds to graphs and each have their uses. The tools can be divided into two groups, with one focusing on searching a particular document, while others focus on searching a database full of documents.

The tool I choose to use was Google’s Ngram program. I wanted to track the nomenclature of the area I have started studying, underwater archaeology. The field is split amongst different terms each with its own implicit meaning, nautical archaeology and maritime archaeology relating to ship archaeology, marine archaeology referring to ocean related archaeology, and underwater archaeology which is an all-encompassing term of all archaeology done under water, however a ship focused archaeological term does not necessarily have to always be underwater. There has always been a debate surrounding which term to use when referring to the field, which there is no shortage of participants in the field taking up their own preferred term over others. I wanted to apply this nomenclature to Google’s Ngram program and see which term comes up the most often on Google Scholar.
The parameters for the search are set from the years 1800 to 2000 and the terms used for the search are as follows: maritime archaeology, nautical archaeology, underwater archaeology, marine archaeology. The chart that follows was particularly interesting as it showed underwater archaeology as the most dominate term to get hits within the database followed by maritime, marine, and nautical. Nautical archaeology was the earliest term appearing with a particular uptick in the 1930s before dropping off. The next significant uptick of all terms comes with the advent of the SCUBA unit in the 1950s as one might naturally think as the SCUBA unit is what allowed this exploration of submerged cultural remains. The latest term to appear with significant usage is maritime archaeology which doesn’t start to gain significant usage until 1974 and gradually out paces all other terms save for underwater archaeology. Interestingly enough, I changed the search parameters to up to the latest year Ngram allows which is 2008. This increase of eight years of content actually caused a dramatic shift in the findings as underwater archaeology drastically drops off and falls behind maritime archaeology. In fact, all terms are on a decline of usage from 2000 to 2008 almost as much of a decline seen in the early 1990s, which serve as a universal low point in the usage of these terms since the invention of the SCUBA unit.

https://books.google.com/ngrams/graph?content=maritime+archaeology%2C+nautical+archaeology%2C+underwater+archaeology%2C+marine+archaeology&year_start=1800&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cmaritime%20archaeology%3B%2Cc0%3B.t1%3B%2Cnautical%20archaeology%3B%2Cc0%3B.t1%3B%2Cunderwater%20archaeology%3B%2Cc0%3B.t1%3B%2Cmarine%20archaeology%3B%2Cc0#t1%3B%2Cmaritime%20archaeology%3B%2Cc1%3B.t1%3B%2Cnautical%20archaeology%3B%2Cc1%3B.t1%3B%2Cunderwater%20archaeology%3B%2Cc1%3B.t1%3B%2Cmarine%20archaeology%3B%2Cc1

Furthermore, this tool allows you to click and search on the terms and see the hits that were provided. A few of the searches for the terms based on time parameters (such as 1800-1950) provide a list of documents and when one is selected a webpage with a scan of every page with the searched term on it within the document is displayed which allows for the not only where the term appears within the selected text, but its context as well. Additionally, a word cloud is even provided based on the text with other similar words and their frequency and if you click on one of the words in the word cloud the page changes its searched term in the text to show scans of the chosen term with its every appearance and context. This is incredibly helpful as it helps defeat one of the criticisms of Big Data which is its lack of transparency and lack of transition from far reading to close reading. However, these extra features such as the word frequency and word clouds do not appear for most texts and this does seem to be in the minority of searches I have done.

Overall, I think using Ngram served my purposes well. I can easily see which terms were the most popular in relation to each other and the tool was simple to use. The ability that Ngram can display word clouds and show text searchable pdfs with every hit of that term within the text all on the same page makes this program have the potential to satisfy a lot the workflow needs when it comes to Big Data. The trend in relation to SCUBA diving is clear to see and if I were to dedicate more time to exploring this concept, I am sure there is more to be gleaned such as the sparse mentions of nautical archaeology before the creation of the SCUBA unit and the downward trend in the early 1990s and post 2000s.

No comments:

Post a Comment