BUS5AP Analytics in Practice

Published : 25-Sep,2021 | Views : 10

Question:

We assume not everyone is familiar with the environment or that they haven’t got an account to operate on Kaggle. Therefore, the first step is to set ourselves up on Kaggle.
1. Go to Kaggle.com and create an account. There are various ways to create an account but for the purpose of this subject, please create an account that shows your name (or student ID) clearly.
a. Thinking longer term, make sure this is a professional sign in. If not, create a new email address, etc. as the environment is a reflection of you professionally.

2. Kaggle has evolved to become a self-hosted environment. It runs both R and Python with many pre-built packages installed. This cloud-hosted environment is call a “Kernel” in Kaggle speak.
a. To explore the Kernel environment, click “Kernels” on the top after signing in.
b. Click “New Kernel” in the centre of the screen and you will be asked to select the data set that you want to work on.
c. Depending on the data set, you will be given a number of programming language options.
d. Have a play with the environment for now with different data sets and based on what you have learned with R or SQL for example.

3. Kaggle also hosts a number of good tutorials, which you can access here https://www.kaggle.com/wiki/Tutorials
4. As a community site, you need to be aware of what is public and what is hidden. Kernels, data sets, competitions, etc., can be either public or private to invited individuals. Have a look around and make sure you understand your privacy settings.
5. Kaggle has an option call “Kaggle in class” which is used primarily for teaching and learning purposes. This will be the environment used for subsequent exercises, which gives us a more controlled environment for a classroom environment and for assignments.
a. If you like to explore this more, you can do so by going to https://inclass.kaggle.com.
b. There are a few smaller and good starting point data sets already pre-loaded for you to explore.

Answer:

The aim of the Kickstarter Campaign is to predict the success or failure of a campaign. It is a multi class classification problem and has five levels namely canceled, failed, live, successful and suspended. It is extremely important to know the impact of the campaign before the launch of the project. Hence this analysis is a vital factor to analyze the funding goal.

Preliminary Analysis:

The projects available in the dataset has either of the five status mentioned above. The project id column and the url does not seem to have any effect on the model building. Hence they have been removed from both the training and test dataset. The levels column has been converted into a factor in both training and test dataset.

Exploratory Analysis:

The goal of the successful projects was 3 times higher than the goal of the failed projects.
The pledged amount was of successful projects was 10 times higher than that of the failed projects.
The comments of successful projects were 14 times the comments of the failed projects.
The updates of successful projects were 4 times that of the failed ones.

Feature Selection

Feature selection is an extremely important aspect of model building. This has the ability to alter the accuracy to a great extent.

Since the data has both categorical and numerical attributes, the method chosen for feature selection is “Boruta”. (Analytics Vidya 2016)

Boruta gives a clear call on the importance of the features. Boruta is popular for following all relevant feature selection method which captures the features that fall under some circumstances most relevant to the predictor variable. Unlike other traditional feature selection algorithms which follow minimal optimal method, this method relies on small subsets of features that yield minimal error.

Boruta checks all the features and displays which features are strongly relevant and weakly relevant to the predictor variable. The technique is extensively applied in medical field because of this feature.

Results of Boruta:

As per the results declared by boruta, the features that are strongly related to the decision variable are category, subcategory, goal, levels, duration and location.

Model Building:

Since this is a classification problem, the methods that can be used are KNN, Random Forest, Decision Tree, SVM, etc. However, the method that has been used is Decision Tree.

A decision tree classifier is a supervised learning method which poses a series of learned questions regarding the features of the training examples. Every time it receives an answer, a question is asked till a final conclusion regarding the label of the class is decided.

After creating a data frame which includes all the important features, a model is built on it using the library rpart.

The accuracy obtained on the training data is 98% and on the test data is 45%.

To improve results, another model that has been applied in SVM

SVM:

SVM, Support Vector Machine is an extremely popular discriminative algorithm that is widely used in the text classification problems as well. SVAM tries to fine the maximum margin hyper-plane that aims at separating the data based on class in the high dimensional space. This significantly solves the optimization problem. (Ray Sunil, 2017)

SVM is a great algorithm that helps in handling over fitting and also reduces the curse of dimensionality.

Tasks Performed By SVM:

Class Separation: SVM basically looks for a hyper plane that gets placed between two classes with maximum margin between the closest points of two classes. The points that first touch the hyper plane are called support vectors

Overlapping Classes: SVM reduces the weight of data points that are wrongly classified into wrong class i.e. it solves the problem of overlapping classes.Nonlinearity: SVM also solves the problem of non-linearity. (Siva, 2013)

When we cannot find a linear separator, we project the data points into higher dimensional space and ensure they can be linearly separated. Most of the time, when one does SVM, there is no need to do preprocessing. SVM is used for supervised classification and a supervised machine learning algorithm. It can be used for both classification and regression problems. The goal of SVM is to arrive at the optimal line that separates two sections using a hyper-plane. SVM is efficient for linear and non linear classification. Non-linear classifications are handled by SVM using Kernal trick. Choice of the kernal for non-linear classifications is tricky and requires trial and error approach.

SVM also has been built on the pruned data simply by using the variables that boruta decided to be important.

Saving the results:

The results have been saved to scores.csv file and the header has been changed to product.id and status. After creation of the csv file, the file has been uploaded to kaggle.

Conclusion:

In this assignment we are trying to predict the class of the campaign using innovative machine learning and classification approaches. Several insights have been developed which can be used for better campaigning. The results explain that better incorporation of the failed projects can significantly improve the robustness of the prediction model and one can perform much better predictions. Another important method that can be used to predict the results of campaign is to use the twitter data and analyze the impact of the campaign on social media. For the above said example, along with SVM, Decision Trees also can be use. But it might take very long time on a computer with average capabilities and chances are high for the results to be unsatisfactory.

References

Available at : https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/ (Accessed: September 13, 2017)

Siva (2013) SVM Implementation Step By Step With R. Available at https://sivaanalytics.wordpress.com/2013/06/15/svm-implementation-step-by-step-with-r-data-preparation/ (Accessed: June 15, 2013)

Saxena R (2017) How Decision Algorithm Works. Available at: http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/ (Accessed: January 30, 2017 )

Analytics Vidya(2016 )How to perform feature selection (i.e. pick important variables) using Boruta Package in R ? Available at: https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/ (Accessed: March 22, 2016).