Posts

Showing posts from June, 2022

DATA PROCESSING

       Today, I manipulated data that I would consider fairly big for the first time. It is the collection of all Project Gutenberg books, which has a total size of 31GB. I am supposed to process the books in order to create metadata of Project Gutenberg, including the full text, title, and author. When I wrote the code, I wrote it fully and debugged it on the go whenever an issue occurs. This is how I usually work but it was not the best method this time. Due to the large amount of data, it takes several hours for a task to complete, so running everything and then stopping midway to debug was time-consuming. Sometimes, my script ran for hours before I realize that something does not work fine, so I had to stop the script, edit the code, then ran it from the beginning again.      It took me about a week to complete the task, and I am still running the final version of the code while writing the prompt. It made me realize that with such data, it is much...

ONLINE RESOURCE VS OWN SCRIPT

       On Wednesday, I was tasked to create a metadata of the entire Project Gutenberg corpus, containing the full book text, the author, and the title. This metadata will be saved online, we will get our portion of data from it, and it may also be useful to other researchers in the future. I found a project on Github that attempted to collect the Project Gutenberg books and process them, but the code is broken in many ways such as it only collects half of the data, it only processes a very small portion of the data, and its filter (per language) does not work. Due to this, the program is practically useless, and I have to collect the data manually. Due to the large size of the data and to prevent the PG's database is not saturated, there is a small delay after each request. Due to this, it takes a few days to collect a total of 68.000 books. I am now on my third day of download and I only got 46.000 books.       It made me realize that although t...

DATA COLLECTION COMPLETE

     After five weeks of intense work, I finally managed to collect the required data for our research project. We have attempted multiple methods but we have come across multiple issues forcing us to switch to a new approach each time. Our final method involves collecting the name of authors who published their works during the Romantic era, the era we are looking for. Then, we took these authors and collected their works from Project Gutenberg. We concluded that, although this method is not 100% accurate, this is the most efficient among the different methods we have tried.      The past five weeks made me realize what the most important aspect of data science is, it is data collection. We spent half of my internship collecting the data and allowing, making sure that we get the best quality possible (in this case, the most data that match our targeted period). Then the second half will be spent processing the data. While we use pipelines such as Gensim to...

SQL vs PYTHON

    The new data collection process was fruitful compared to the previous ones. This latest approach made me realize that the main goal is to get the books from a particular era, not to get the publication date of the book. As a reminder, our latest method was based on a list of authors that one of my supervisors provided. The list is a metadata containing a list of multiple authors and the period during which they published their work. The period we are looking for is the Romanticism period from 1780 to 1837. So I started from with that list, filtered the authors who published their works during the period we are looking for. we create a new metadata for these authors. Then we take the new metadata of the authors and the project gutenberg metadata, then we join the them based on the name of authors. In the end we end up with a new metadata containing the British authors and the books they have published during the romantic period.     All the task described above wa...

STEP BY STEP LEARNING

     I shifted my work environment from my local machine to a cloud platform called Databricks a few days ago. Databricks is a cloud platform for cluster computing using Spark. For the moment, my main task revolves around SQL queries as we work with various tables to create one full metadata. This part of the internship is what I find the most exciting as we are working on a big data platform. With my intent to pursue a career in data science, and experience with big data is very valuable, especially with SQL and Spark. The cloud platform also makes it easier to work in a team and makes my work more interactive, allowing my supervisors to inspect it.      What I have realized about my job so far is that almost all the skills that I apply to my work are from my personal experience, not from any class I have taken in college. It may be because I just finished freshman year and have only taken two computer science classes so far. Due to the intensity of the wo...

A BUSY WEEK

       This week was one of the busiest weeks in my internship so far. I had to finalize my script really fast so that I could present it to my supervisor. When I finally presented my work, we had to consult a librarian to verify its accuracy. I am happy that the librarian approved my work as the only best way to gather the data from the Library of Congress. However, when we discussed the next step with my supervisors, we happened to find out an even better step to gather the most accurate data. This time it does not involve collecting data from a third-party website or scraping any website. Instead, my supervisors have a list of authors along with all their information, including their era. The era we are looking for is the Romantic era which goes from 1780 to 1838. So, this time, what we will do is work with the authors, and collect those who match the era we are interested in. When we get all the authors, we match them with our project Gutenberg metadata and only ...

THE SCRIPT

     I finally managed to create a script that works. It is an implementation of CQL in Python. I spent so many days working on this script, and I encountered errors after errors. I have a total of 17 thousand books that I have to find the country and region of publication for. I am using the library of congress to collect the data with the Search/Retrieve via URL, a Contextual Query Language. With such large data, it is hard to debug as I never know when an issue may occur. Also, CQL is a language that does not really have a big online community, so finding solutions online was hard. Hence, I had to spend at least 2 days fixing a single bug. The biggest issue I had was a title not being supported by CQL even if the title was in string format. After spending hours figuring out why a string is not considered a string, I found out that there is a difference between string and string literal; I do not even understand the difference, but at least I found out that python ...

CHALLENGES WITH DEBUGGING

       I am in the process of writing a script to scrape the data from the library of congress. I am using the Query language called SRU, but I have to combine it with Python for the loop. I am using a module created by an individual that allows me to use SRU with python. However, I am currently facing multiple issues associated with the module. The main error I am having is a server error which prevents me from getting any data. So far, I cannot detect the origin of the issue, but I know that this only happens during the loop. If I try to isolate the issue by making a single request with the same keywords, I do not get the error.      With an issue like this, I can only rely on my past experience and try to debug the script in the best way possible. This is easier said than done as SRU is not so popular query language, so finding support for it is hard. I cannot rely on resources such as StackOverflow or any online forum. This is going to be the first...

I LEARNED SOMETHING NEW

    Today, I learned something very interesting about the world of libraries. As I mentioned in my last blog post, I had to come up with a new method to collect the year and region of publication of the books that we need for the project. One of my supervisors suggested I use the data from the library of congress. This is the national library of the United States, so all the data on it are reliable, and we are more than likely to be able to get all the information we need from it. The library of congress does not have an API, but there is a way to query data using the querying language called Search/Retrieve via URL (SRU). This type of query works in a way that you start with the base URL of the server you want the data from, add the keyword you want to search, then the limit of results you want to see, so the query is in form of a long URL. When a query is made, the result comes back in the browser in XML format.      The result of the query is very i...

BACK TO SQUARE ONE

When terms and conditions are meant to be read      After completing a fully-working version of the Worldcat crawler and presenting it to my supervisor, I was informed that my script could not be used and that I have to come up with another method to collect the data. As a reminder, I found out that WorldCat has most of the data that we are looking for. It does have an API but it is paid and can only be accessed by an organization or a company. So, I decided to crawl the website instead to extract the data we need. Today, I presented the result to my supervisor and was informed that WorldCat's policies prevent us from using it. WorldCat mentions in its terms and conditions that scraping, crawling, and bot-accessing the website is prohibited. My supervisor had to inform me about this for me to realize it.       Before this research project, I have scraped many websites to collect their data for personal projects and school projects, so never considered ...

SCRAPING WORLDCAT

Image
  The final step for getting the books' publication dates      Yesterday, I started writing the script to scrape WorldCat using Python and Selenium. This task is particularly taught and time-consuming. I spent more than 15 hours working on it and it is still not finished. The trickiest part is that I am trying to get the date each of the +17 thousand books was published. My method is that I pass the title of a book to the search bar, go through the results, find the earliest edition, then extract the publication date. The tricky part is that I am working with string data (both the "titles" and "authors" are strings), but I have to make sure that I get the right book from the right author from the search result. Unlike an integer data type, a lot of variations may be seen in a string making it hard to make an exact comparison. Therefore, it is much harder to generalize the script.       The best option I had to create this script was by creating a ...

PROGRESS IN DATA COLLECTION

 The data collection process      Today, I had my first meeting with the other fellow students and professors. During the meeting, I had a chance to chat with a professor about the work I am doing and the stage where I currently am at. The professor told me about his interest in the same data and how unsuccessful he was so far. I got to share with him some of the methods I have used, my result so far, and the new method I am planning to try. This is now my third straight blog post talking about the same step, which is collecting books published in the U.K. from a specific period of time. This is a very tricky part of the project as there is no good way to collect the data. Project Gutenberg has the books, but not their country or year of publication, so I only selected the books that are written in English, then I have to rely on other sources to figure out when and where each of the books was published.       After exploring all the possible opti...