DATA COLLECTION COMPLETE
After five weeks of intense work, I finally managed to collect the required data for our research project. We have attempted multiple methods but we have come across multiple issues forcing us to switch to a new approach each time. Our final method involves collecting the name of authors who published their works during the Romantic era, the era we are looking for. Then, we took these authors and collected their works from Project Gutenberg. We concluded that, although this method is not 100% accurate, this is the most efficient among the different methods we have tried.
The past five weeks made me realize what the most important aspect of data science is, it is data collection. We spent half of my internship collecting the data and allowing, making sure that we get the best quality possible (in this case, the most data that match our targeted period). Then the second half will be spent processing the data. While we use pipelines such as Gensim to process topic modeling or NLTK and SpaCy to preprocess the data, we could not use any prebuilt tool to collect the data so we had to create everything from scratch. I realized that even if I am interested in data science, the skills of a data engineer and a data analyst are very much needed in order for me to be able to efficiently analyze the data and judge its quality.
Comments
Post a Comment