A NEW TEXT PREPROCESSING METHOD
I was task to clean the data and to only get the books at have content of value to us. Since we are focusing on laws during the romanticism period, we only needed books that mentions it. The way we cleaned the data was that we take one book, split it, find if the if the word "law" appears in there. If the word appears, we collect 500 words before and after it, and only work with that portion of text. We do this process with every single one of our +1400 books. Since some books do not talk about our topics, some books will be filtered out during the process. This process allows us to get a clean and very accurate data.
After the data cleaning, we feed the portions of texts onto Gensim for an LDA topic modeling. The result so far is satisfactory but my supervisor still has to review it before we move to the next step.
This is the first time I ever used this method of data preprocessing. I have always used all the data I collected without worrying about it being biased or out of content. As I result, my NLP results never get as good as I expected it to be. I will try to apply this method for my future projects, especially when I collect raw data from various sources where the topics are not always the same.
Comments
Post a Comment