SCRAPING WORLDCAT
The final step for getting the books' publication dates
Yesterday, I started writing the script to scrape WorldCat using Python and Selenium. This task is particularly taught and time-consuming. I spent more than 15 hours working on it and it is still not finished. The trickiest part is that I am trying to get the date each of the +17 thousand books was published. My method is that I pass the title of a book to the search bar, go through the results, find the earliest edition, then extract the publication date. The tricky part is that I am working with string data (both the "titles" and "authors" are strings), but I have to make sure that I get the right book from the right author from the search result. Unlike an integer data type, a lot of variations may be seen in a string making it hard to make an exact comparison. Therefore, it is much harder to generalize the script.
The best option I had to create this script was by creating a much simpler version, running it, detecting any issue or exception, then trying to solve the exception in a more general way, then running it again. When the small version works fine, I add one or two more features, then repeat the process. At the time of writing this post, I am at the step where I print out the date for the earliest edition of the books. I experienced an exception in the way the website is laid out, so I will have to fix it. My current script is that I go to a page that contains the list of all the available editions to find the earliest one. But there are some books that only have one edition so the page I am supposed to access does not exist, hence I get an error. So, I will create a condition to tackle this exception.
While thinking about the new condition I need to create, I realized that I could improve the way I feed the data into the search box: Instead of only searching for the title of the book and then finding the one by the correct author from the result, I can, instead, feed in the "title by author" which will only return the result with the best match. I will still need a few more days to complete this script. It is likely that I will be able to create a perfect general script that satisfies all the 17 thousand books, but I will surely create one that will allow me to collect the maximum data possible.

Comments
Post a Comment