Posts

Showing posts from July, 2022

THE TIME COMPLEXITY

     During my Data Structure class, we learned about the time and space complexities. With the scale of projects that I have been working on so far in my project, I never really paid attention to the time complexity. However, now that I am working with much bigger data and a much more complex project, I realized how time complexity can be an issue if the code is not optimized. I am doing data science tasks, so I am using a notebook instead of a regular text editor. With a notebook, I can run a specific cell which makes it easier to debug. For example, I have a cell that processes the full data, then the next sell to explore the result. Once the data is processed, the result will be saved in memory so it does need to be run again regardless of any change made in any cell below it.      While I can get away with a badly optimized code by using a notebook, I know that sooner or later I will have to create a program where I will have to use a regular text edit...

WHAT I LEARNED FROM MY INTERNSHIP

     This is the final week of my internship, I am finalizing all my tasks for both projects. As a reminder, I am working on two projects at the same time: data collection and data processing with Project Gutenberg that I complete alone, and dashboard creation using data from British Periodicals that I do with fellow students from Vanderbilt University. Thanks to these two projects, I got the chance to work alone and to work in a team all in a single internship.       While doing two works at the same time may mean that I have to do more work, it taught me to manage my schedule well, prioritize specific work, and complete everything on time. I can say that compared to my first year of college, my time management skills have greatly improved thanks to this internship. Also, I was hoping that this experience would allow me to find my limit in terms of problem-solving, but it was not the case as I have always managed to find a solution for any of the issu...

FINAL TASK

       Today, I started what probably is my last work with Project Gutenberg. I am working on a named entity recognition. This is my second time applying timed entity recognition to this data. During my previous task, I used SpaCy to do the NER, but this time I am using JohnSnowLabs. JohnSnowLabs is a website with a collection of pre-trained machine learning pipelines. This time I am using the BERT pipeline to accomplish my task. The final result of my work will be a topic modeling and a named entity recognition which my supervisor will then analyze.      I think it is the right time to reflect on some of the things I accomplished during these nine weeks of work. I have learned a lot from this experience, whether it is a way to complete work, a learning habit, or an ability to multitask, I have gained new skills, knowledge, and work experience. I believed that these gains will allow me to perform well in my future internship.

PREPARING FOR MY PAPER

       For the past few days, I have been stepping away from my main task in internship as I am preparing my internship paper. I talked with my supervisors yesterday about our plan as we wrap up with the internship with one more week to go. While I am a computer science, the data I am working with is a literary work by authors from a specific literature period, so my paper will need to include some information about literature, however, this is totally outside of my field and I do not know anything about it.      Fortunately, both my supervisors have a good amount of literature knowledge, and the one of the, an English professor is an expert in the specific era we are working on. So, during our talk yesterday, I asked them if they could assist me finding some resources and references for the literature part of my paper. Both of them were happy to help me. So, we are having a zoom meeting tomorrow where I will ask any question I have regarding the perio...

MODULE DOCUMENTATIONS

     I am doing the named entity recognition of our data at the moment. It allows us to collect the useful information such as which character is mentioned the most in the portion of text and where the action is taking place. I am using SpaCy to accomplish this task. SpaCy was recently updated to version 3.x.x which brought a lot of changes. Since I am working alone all the time, I have to depend on online communities such as StackOveflow whenever I have any issue. Whenever I have an issue with SpaCy and look it up online, I always have to figure out what version the person is using, then figure out how to convert it to the version I am using.      This process taught me to rely more about how to read module documentations, and to rely more on them rather than just copying and pasting code portions online. Reading the documentation gives me the most accurate information about the latest version and a detailed explanation of each of the syntaxes. This allows...

A NEW TEXT PREPROCESSING METHOD

     I was task to clean the data and to only get the books at have content of value to us. Since we are focusing on laws during the romanticism period, we only needed books that mentions it. The way we cleaned the data was that we take one book, split it, find if the if the word "law" appears in there. If the word appears, we collect 500 words before and after it, and only work with that portion of text. We do this process with every single one of our +1400 books. Since some books do not talk about our topics, some books will be filtered out during the process. This process allows us to get a clean and very accurate data.      After the data cleaning, we feed the portions of texts onto Gensim for an LDA topic modeling. The result so far is satisfactory but my supervisor still has to review it before we move to the next step.      This is the first time I ever used this method of data preprocessing. I have always used all the data I collected...

TEAM WORK

      NLTK is arguably one of the best tools out there for removing stopwords. It is widely used and well supported, so I decided to use it for our project. I realized that while NLTK did its job well, we still ended up with a lot of stopwords that confused our model and ruined the result of our topic modeling. The reason for this was due to the fact that we are working with 2-century-old data. There are different types of words that have changed over time, some words that are proper to a specific country/region, like Scotland for example. There are words such as "otter" or "weel". These words are not recognized by NLTK so could not be removed. So, one of my supervisors, an English professor, who is also the head of the project, had to make a list of all these other stopwords manually and gave them to me so that I could add them to the code.      It is a prime example of how it is always better to have different people with specific expertise on a spec...

NEW KNOWLEDGE

      I started using PySpark for the first time while processing our data. At first, I thought it would be similar to Python, so it would be easy since I already know python, but I was wrong, there are a lot of differences and I needed time to learn it. PySpark is not a different language, it is more of a module for the Spark platform so it is based on python. I also found out that CSV files differ from one module to another. I created a CSV of our data using Pandas, and when we ran it on Spark, the file could not open properly, so we had to read the file via Pandas, then convert it from Pandas data frame to SPark data frame.       The fact that PySpark is different from the tool I am used to work with is a good thing for me because I managed to learn it, and now, I have gained a new skill. I always wanted to know at least two ways of completing a task. For example, knowing both TensorFlow and PyTorch, knowing both spark and Hadoop, or knowing bo...

IMPORTANCE OF VARIOUS SKILLSETS

      As part of the second part of my internship, we are working with python to process the data, and we are mainly doing topic modeling. This is going to be the very first time that I will be using a tool I learned in college. We will be using Gensim to do topic modeling, and I learned about Gensim during my Text Analysis class right before my internship. I feel happy that some of the skills I learned will be useful in an actual project. However, we will not be using Gensim exclusively, this is only one of the few more libraries.      Here again, we will be using more than one tool for the same task in order to get the best result. This is similar to what we did during the data collection. It made me realize how important it is to always compare different options and to know which one to choose. I used to be the type of person who would master a single tool and only do my job with it. For example, I only mastered python for data science and Pandas fo...

THE IMPOSTER SYNDROME

     Yesterday, I witnessed my supervisor as he uploaded my data to the system. This officially marks the end of the data collection part of the project, and now we are moving to the data processing part. I am happy and feel proud of myself for being able to contribute to this project and actually accomplish the task that was assigned to me. For the first time, I felt that I could actually complete the job. Despite the positive comments my supervisors usually give me, I always felt like I was having imposter syndrome. I can understand why I felt that way because, during the entirety of the first part of the project, I used tools that I have never used before in a real project, but they relied on me 100% to complete the task since I was the only one working on it. So, I was constantly worried that I would not meet their expectation, that I would spend too much time learning the method and not enough time to actually do the work, or that I would do the task wrong leading to...

A MOMENT OF REST

     For the past few days, I have been doing nothing. This comes after spending the first six weeks of my internship collecting data. I used that free time to explore some personal projects and to apply the skills I have learned so far. The biggest learning for me is the value of data in NLP. W spent more than half of my internship period collecting data, just to make sure that we get the best quality possible.       In most NLP domains such as topic modeling is very sensitive and a slight change in the input data will deeply affect the outcome of a model. This is the reason why my supervisor, the one who will work directly with the end result of our work, requires that we get the original edition of the books or the earliest edition available, this is because books centuries years old may have been published multiple times and the text and contents may have been changed to adapt to the period during which the edition is printed.     ...

MULTITASKING

     During this research, the team is working with two different types of data: Project Gutenberg and the British Periodicals. I am the only one working on the Project Gutenberg part but I also take part in the British Periodicals. Managing both tasks is quite challenging, but I still managed to handle them so far. I just finished collecting the data from Project Gutenberg, and I am now switching over to my other task with British Periodicals where I exclusively use SQL so far. Working on two projects at once is challenging and requires a lot of self-discipline and planning.       I am not the type of person who would multitask, I prefer to do things one thing at a time, so this is a new experience for me. Although I had some struggles, in the beginning, I can say that I get better and better as I learn from any mistake I made. I know that even if I work on both projects at the same time, Project Gutenberg is still my main task, so I always prioritize...

MANUAL VS AUTOMATED

     This week, I focused on storing all the texts that we got from Project Gutenberg on a single CSV file, along with their title and author. After completing everything, I reviewed the file and found out that I was missing around ninety books. I had two options to solve these issues: I can either re-run the entire program from scratch including the book download or search each of the books manually from the Project Gutenberg website and copy the text. The first option seems to be a much better option because everything is automated. Also, if I rerun it, it checks if the book is already downloaded before it downloads it, so it only gets the missing books. But there are two reasons why this option is not viable: downloading the books only takes about 2, and running everything does not guarantee success because the reason why the books were not downloaded was not identified. Therefore, the only option was to search each of the books manually. During the process, I found ou...