PROGRESS IN DATA COLLECTION

 The data collection process

    Today, I had my first meeting with the other fellow students and professors. During the meeting, I had a chance to chat with a professor about the work I am doing and the stage where I currently am at. The professor told me about his interest in the same data and how unsuccessful he was so far. I got to share with him some of the methods I have used, my result so far, and the new method I am planning to try. This is now my third straight blog post talking about the same step, which is collecting books published in the U.K. from a specific period of time. This is a very tricky part of the project as there is no good way to collect the data. Project Gutenberg has the books, but not their country or year of publication, so I only selected the books that are written in English, then I have to rely on other sources to figure out when and where each of the books was published. 

    After exploring all the possible options, I am now planning to try WorldCat. WorldCat has an API but it is paid and is not easily accessible to an individual. So, instead of using an API, I will scrape the website with a python script using a website emulator Selenium Webdriver. This method is not guaranteed to work 100% as I do expect to be denied access to the website after a certain amount of requests. This issue can be solved by putting a short timeout between each request or a longer one between a batch of requests. This timeout will minimize any impact such a large query would have on the WorldCat server. 

    Chatting with that professor today made me realize that other people are actually looking for the same data that we are, and yet, it is not easily accessible. So, I am planning to collect the year and country of publication of all the books in English from Project Gutenberg and store the data somewhere online for people to use.

Comments

Popular posts from this blog

FINAL TASK

FINAL POINT