APPLYING DIFFERENT THOUGHTS
Testing out a way to filter the book by year and country of release
I have been having difficulty figuring out the best way to filter out the books released in the UK from a specific time frame. The metadata from Project Gutenberg does not provide either the country of origin or release date. As described in my last blog post, I have to collect this information in different ways and combine them all. While looking at the metadata again during the weekend, I realized that it provides most of the authors' years of birth and death. Of the +17,000 books in English on the Project Gutenberg, around 12,000 books come along with the year of birth and death of their authors. While thinking of the way of filtering the books, I remember one key information one of my supervisors told me: if I cannot find any information about a particular book, any information about its author may be helpful. So I realized that even if I do not know when the book was released, I am certain that an unborn or dead author cannot publish any book. So I decided to spend some of my holidays today applying the first filter, which is based on the author.
I removed any book whose author was dead before the beginning of our timeframe and any book whose author was born after the end of our timeframe. This process required a few data manipulation as I mainly worked on the new metadata I previously created to make the filter. After the filter, my list went down from around 17,000 to around 8,000. With this shorter list, my next step is now to apply the other filter, such as the date from Wikidata and WorldCat. The method I used today may be used again in the future: I will most likely use the authors' nationality/country of origin to determine the country of origin of their books in case such information cannot be found anywhere.
While working on this project, I realized that despite the effort the folks working for the project Gutenberg has put into this project, there are still missing information and big room for improvement. I am not sure if it was the work of the same group or of a different individual but the initiative to put the book ID of some of the Project Gutenberg on Wikidata was a bright idea. However, only a very small portion of the books is recorded on Wikidata. So, I was thinking that, since I already have the metadata of the entire book, I may continue the work on Wikidata during my free time in the future. By including all the new information that I am scraping right now, I would time for future researchers and people who will use the Project Gutenberg for their projects.

Comments
Post a Comment