Posts

FINAL POINT

       I recently completed my internship at Vanderbilt University, where I worked with an English professor and a librarian for a text-mining project. The goal of the project is to help the English professor in his research about how Law is represented in general dialogue during the Romanticism period (a period of literature between 1750 and 1850). So, the data collected were English books published in the United Kingdom and Ireland between 1770 and 1835. The main source of data was Project Gutenberg, an online library of free eBooks.       We faced a load of challenges while working on this project. One of the biggest issues was figuring out the correct book to collect. While project Gutenberg provides some information such as authors and editions of the book, it does not provide the year of publication, so there was no way for us to know what the right book is. We have attempted to solve the problem by collecting all the books in English and fr...

THE TIME COMPLEXITY

     During my Data Structure class, we learned about the time and space complexities. With the scale of projects that I have been working on so far in my project, I never really paid attention to the time complexity. However, now that I am working with much bigger data and a much more complex project, I realized how time complexity can be an issue if the code is not optimized. I am doing data science tasks, so I am using a notebook instead of a regular text editor. With a notebook, I can run a specific cell which makes it easier to debug. For example, I have a cell that processes the full data, then the next sell to explore the result. Once the data is processed, the result will be saved in memory so it does need to be run again regardless of any change made in any cell below it.      While I can get away with a badly optimized code by using a notebook, I know that sooner or later I will have to create a program where I will have to use a regular text edit...

WHAT I LEARNED FROM MY INTERNSHIP

     This is the final week of my internship, I am finalizing all my tasks for both projects. As a reminder, I am working on two projects at the same time: data collection and data processing with Project Gutenberg that I complete alone, and dashboard creation using data from British Periodicals that I do with fellow students from Vanderbilt University. Thanks to these two projects, I got the chance to work alone and to work in a team all in a single internship.       While doing two works at the same time may mean that I have to do more work, it taught me to manage my schedule well, prioritize specific work, and complete everything on time. I can say that compared to my first year of college, my time management skills have greatly improved thanks to this internship. Also, I was hoping that this experience would allow me to find my limit in terms of problem-solving, but it was not the case as I have always managed to find a solution for any of the issu...

FINAL TASK

       Today, I started what probably is my last work with Project Gutenberg. I am working on a named entity recognition. This is my second time applying timed entity recognition to this data. During my previous task, I used SpaCy to do the NER, but this time I am using JohnSnowLabs. JohnSnowLabs is a website with a collection of pre-trained machine learning pipelines. This time I am using the BERT pipeline to accomplish my task. The final result of my work will be a topic modeling and a named entity recognition which my supervisor will then analyze.      I think it is the right time to reflect on some of the things I accomplished during these nine weeks of work. I have learned a lot from this experience, whether it is a way to complete work, a learning habit, or an ability to multitask, I have gained new skills, knowledge, and work experience. I believed that these gains will allow me to perform well in my future internship.

PREPARING FOR MY PAPER

       For the past few days, I have been stepping away from my main task in internship as I am preparing my internship paper. I talked with my supervisors yesterday about our plan as we wrap up with the internship with one more week to go. While I am a computer science, the data I am working with is a literary work by authors from a specific literature period, so my paper will need to include some information about literature, however, this is totally outside of my field and I do not know anything about it.      Fortunately, both my supervisors have a good amount of literature knowledge, and the one of the, an English professor is an expert in the specific era we are working on. So, during our talk yesterday, I asked them if they could assist me finding some resources and references for the literature part of my paper. Both of them were happy to help me. So, we are having a zoom meeting tomorrow where I will ask any question I have regarding the perio...

MODULE DOCUMENTATIONS

     I am doing the named entity recognition of our data at the moment. It allows us to collect the useful information such as which character is mentioned the most in the portion of text and where the action is taking place. I am using SpaCy to accomplish this task. SpaCy was recently updated to version 3.x.x which brought a lot of changes. Since I am working alone all the time, I have to depend on online communities such as StackOveflow whenever I have any issue. Whenever I have an issue with SpaCy and look it up online, I always have to figure out what version the person is using, then figure out how to convert it to the version I am using.      This process taught me to rely more about how to read module documentations, and to rely more on them rather than just copying and pasting code portions online. Reading the documentation gives me the most accurate information about the latest version and a detailed explanation of each of the syntaxes. This allows...

A NEW TEXT PREPROCESSING METHOD

     I was task to clean the data and to only get the books at have content of value to us. Since we are focusing on laws during the romanticism period, we only needed books that mentions it. The way we cleaned the data was that we take one book, split it, find if the if the word "law" appears in there. If the word appears, we collect 500 words before and after it, and only work with that portion of text. We do this process with every single one of our +1400 books. Since some books do not talk about our topics, some books will be filtered out during the process. This process allows us to get a clean and very accurate data.      After the data cleaning, we feed the portions of texts onto Gensim for an LDA topic modeling. The result so far is satisfactory but my supervisor still has to review it before we move to the next step.      This is the first time I ever used this method of data preprocessing. I have always used all the data I collected...