DATA PROCESSING

June 28, 2022

Today, I manipulated data that I would consider fairly big for the first time. It is the collection of all Project Gutenberg books, which has a total size of 31GB. I am supposed to process the books in order to create metadata of Project Gutenberg, including the full text, title, and author. When I wrote the code, I wrote it fully and debugged it on the go whenever an issue occurs. This is how I usually work but it was not the best method this time. Due to the large amount of data, it takes several hours for a task to complete, so running everything and then stopping midway to debug was time-consuming. Sometimes, my script ran for hours before I realize that something does not work fine, so I had to stop the script, edit the code, then ran it from the beginning again.

It took me about a week to complete the task, and I am still running the final version of the code while writing the prompt. It made me realize that with such data, it is much more efficient to do a ground truth with a small sample of the data. The ground truth is the process of running the code with a small portion of data, where the script can run and finish within a short period. This method makes debugging much easier and takes less time. The concept of ground truth is that when everything works fine, we simply change the small data to the bigger one, and the code should perform the same. I have learned this the hard way, and I will make sure to not make this mistake again in the future.

Search This Blog

MY FIRST ACADEMIC RESEARCH JOURNEY

DATA PROCESSING

Comments

Post a Comment

Popular posts from this blog

FINAL TASK

FINAL POINT