The final step for getting the books' publication dates

Yesterday, I started writing the script to scrape WorldCat using Python and Selenium. This task is particularly taught and time-consuming. I spent more than 15 hours working on it and it is still not finished. The trickiest part is that I am trying to get the date each of the +17 thousand books was published. My method is that I pass the title of a book to the search bar, go through the results, find the earliest edition, then extract the publication date. The tricky part is that I am working with string data (both the "titles" and "authors" are strings), but I have to make sure that I get the right book from the right author from the search result. Unlike an integer data type, a lot of variations may be seen in a string making it hard to make an exact comparison. Therefore, it is much harder to generalize the script.

The best option I had to create this script was by creating a much simpler version, running it, detecting any issue or exception, then trying to solve the exception in a more general way, then running it again. When the small version works fine, I add one or two more features, then repeat the process. At the time of writing this post, I am at the step where I print out the date for the earliest edition of the books. I experienced an exception in the way the website is laid out, so I will have to fix it. My current script is that I go to a page that contains the list of all the available editions to find the earliest one. But there are some books that only have one edition so the page I am supposed to access does not exist, hence I get an error. So, I will create a condition to tackle this exception.

While thinking about the new condition I need to create, I realized that I could improve the way I feed the data into the search box: Instead of only searching for the title of the book and then finding the one by the correct author from the result, I can, instead, feed in the "title by author" which will only return the result with the best match. I will still need a few more days to complete this script. It is likely that I will be able to create a perfect general script that satisfies all the 17 thousand books, but I will surely create one that will allow me to collect the maximum data possible.

Search This Blog

MY FIRST ACADEMIC RESEARCH JOURNEY

SCRAPING WORLDCAT

The final step for getting the books' publication dates

Comments

Post a Comment

Popular posts from this blog

DATA COLLECTION COMPLETE

ONLINE RESOURCE VS OWN SCRIPT