![]() ├── log_lifestyle log output of the script ├── lifestyle_urls.txt urls collected for scraping ├── lifestyle_7000.txt 7000 training samples for lifestyle ├── lifestyle.txt cleaned data for lifestyle │ ├── original_law.txt original scraped data for law │ ├── law_7000.txt 7000 training samples for law │ └── test_finance.txt test sample for finance │ ├── original_finance.txt Original scraped file │ ├── finance_urls.txt urls scraped for finance ├── finance From original scraped data and cleaned one │ ├── test_fashion.txt test data for python 1001 samples │ ├── fashion_original.txt Original scraped data ├── fashion From original scraped data and cleaned one │ └── lifestyle_7000.txt 7000 training data for class lifestyle │ ├── law_7000.txt 7000 training data for class law ![]() ![]() │ ├── finance_7000.txt 7000 training data for class finance │ ├── fashion_7000.txt 7000 training data for class fashion ├── collect_url_data.py Python script that scrapes articles Raw_data/ Contains files related to train and test The folder structure and the data files description is as follows: Code for the same is uploaded in the Github. I have used Goose and BeautifulSoup to scrape the articles. ![]() Scraping the data from the same source would be help in keeping the homogeneity in the articles. I would describe the files and the procedure I followed to get the data, train the model, test the model and the results.įirst, I went to the leading newspaper TheGuardian and looked for the labels i.e Finance, Law, Fashion, Lifestyle. I have used Denny Britz code for implementing the CNN( convolutional neural network). I had researched on text classification libraries and different approaches to solve this problem and decided to use CNN. In past, I had used NLTK and python to solve the above problem, but neural networks have proven to be more accurate when it comes to NLP. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |