ONLINE RESOURCE VS OWN SCRIPT

     On Wednesday, I was tasked to create a metadata of the entire Project Gutenberg corpus, containing the full book text, the author, and the title. This metadata will be saved online, we will get our portion of data from it, and it may also be useful to other researchers in the future. I found a project on Github that attempted to collect the Project Gutenberg books and process them, but the code is broken in many ways such as it only collects half of the data, it only processes a very small portion of the data, and its filter (per language) does not work. Due to this, the program is practically useless, and I have to collect the data manually. Due to the large size of the data and to prevent the PG's database is not saturated, there is a small delay after each request. Due to this, it takes a few days to collect a total of 68.000 books. I am now on my third day of download and I only got 46.000 books. 

    It made me realize that although there are many resources online, these are not always reliable, and instead of using that resource, it is better to try to understand how it works and reproduce it. Then I found out that trying to debug a code of someone else is much harder than creating your own program, especially if the code is not commented. I have always been told to comment on a code, and, since I always know what I do, I do not feel the need to put comments, but this project made me realize how important comment is, not for me, but for anyone who will use my program in the future.

Comments

Popular posts from this blog

FINAL TASK

FINAL POINT