Being a data scientist and one of the contributor of AnalyticsVidhya, I am an active follower of the website, a great place for data science discussions, articles and hacks. Last weekend, I was looking for project ideas for a self mini hack, and decided to create something around AV.
Most of the modern day websites are data repositories in themselves. Article text, comments, users, authors and post-tags all comprise of valuable data which can used for creating co-relations, predictions and forcastings.
For example - The New York Times released a report about effect of weather on crops and impact on total losses by analysing their news data.
I started building a story based on AV's articles, authors, comments and tags. The steps were to obtain the complete data from the website, perform analysis, cleaning and modelling techniques, find out trends, insights and visualize them.
Data Collection - A data collection system was required to obtain the complete data from the website and store at a remote location. I created a python module comprised of seeding (for obtaining links/urls), crawling (for obtaining the webpages sourcecode of pages) and parsing (for obtaining the relevant information) classes. I used various text cleaning and NLP techniques to derive extra variables from the article text such as number of words used, number of sentences, noun usage, verb usage etc.
In order to run data collection jobs in parallel and save time, I multiprocessed the system using celery as an asynchronous task queue with the support from redis as a message broker.
After the complete run in about 6 Minutes, 450+ articles and 10K comments were captured and pushed to mongodb, hosted on a micro ec2 server.
Exploratory Data Analysis - MongoDb provides a support for aggregated queries which are light weight and easy to use. MongoDB as a standalone tool can be used for the purpose of EDA. I extracted number of aggregated stats and queries from simple to complex using mongo such as - "year wise author wise average word usage per article per author".
Rest API - Complete data was stored in a remote storage, I created a rest API in flask so that data can be sliced and diced according to the requirement. The API was able to push and pull data with filters such as aggregatations, grouping etc. The API is hosted here.
Data Visualizations - There are a number of data visualuzation libraries out there such as D3.js, Google Charts etc. My personal favourite is D3 but for quick and dirty hacks, google charts is an awesome pick. It is flexible, easy to use and provides variety of customizations. Google Charts accepts a particular format of data for every visualization, hence I wrote a wrapper over Mongo EDA query results to get desired format for visualizations.
An interesting finding that came out of this analysis was, there was a rising trend in Machine Learning articles on website over time. Also, more articles were published on starting of week as compared to weekends. Here are few insights from the analysis results:
Here is the link of complete analysis. The beauty of this analysis was not only the insights but also the generic data mining, cleaning, parsing, database querying engine which is extendable to many other analysis. Feel free to discuss the approach more in detail or share your views in comments.
API: http://52.74.204.17:5000/
Insights: http://52.74.204.17:5000/insights
Most of the modern day websites are data repositories in themselves. Article text, comments, users, authors and post-tags all comprise of valuable data which can used for creating co-relations, predictions and forcastings.
For example - The New York Times released a report about effect of weather on crops and impact on total losses by analysing their news data.
img src: www.nytimes.com
Data Collection - A data collection system was required to obtain the complete data from the website and store at a remote location. I created a python module comprised of seeding (for obtaining links/urls), crawling (for obtaining the webpages sourcecode of pages) and parsing (for obtaining the relevant information) classes. I used various text cleaning and NLP techniques to derive extra variables from the article text such as number of words used, number of sentences, noun usage, verb usage etc.
In order to run data collection jobs in parallel and save time, I multiprocessed the system using celery as an asynchronous task queue with the support from redis as a message broker.
After the complete run in about 6 Minutes, 450+ articles and 10K comments were captured and pushed to mongodb, hosted on a micro ec2 server.
Exploratory Data Analysis - MongoDb provides a support for aggregated queries which are light weight and easy to use. MongoDB as a standalone tool can be used for the purpose of EDA. I extracted number of aggregated stats and queries from simple to complex using mongo such as - "year wise author wise average word usage per article per author".
Rest API - Complete data was stored in a remote storage, I created a rest API in flask so that data can be sliced and diced according to the requirement. The API was able to push and pull data with filters such as aggregatations, grouping etc. The API is hosted here.
Data Visualizations - There are a number of data visualuzation libraries out there such as D3.js, Google Charts etc. My personal favourite is D3 but for quick and dirty hacks, google charts is an awesome pick. It is flexible, easy to use and provides variety of customizations. Google Charts accepts a particular format of data for every visualization, hence I wrote a wrapper over Mongo EDA query results to get desired format for visualizations.
An interesting finding that came out of this analysis was, there was a rising trend in Machine Learning articles on website over time. Also, more articles were published on starting of week as compared to weekends. Here are few insights from the analysis results:
Here is the link of complete analysis. The beauty of this analysis was not only the insights but also the generic data mining, cleaning, parsing, database querying engine which is extendable to many other analysis. Feel free to discuss the approach more in detail or share your views in comments.
API: http://52.74.204.17:5000/
Insights: http://52.74.204.17:5000/insights