New! Hire Essay Assignment Writer Online and Get Flat 20% Discount!!Order Now

BUS5AP Analytics in Practice

Published : 25-Sep,2021  |  Views : 10

Question:

We assume not everyone is familiar with the environment or that they haven’t got an account to  operate on Kaggle. Therefore, the first step is to set ourselves up on Kaggle.
1. Go to Kaggle.com and create an account. There are various ways to create an account but for  the purpose of this subject, please create an account that shows your name (or student ID) clearly.
a. Thinking longer term, make sure this is a professional sign in. If not, create a new email address, etc. as the environment is a reflection of you professionally.

2. Kaggle has evolved to become a self-hosted environment. It runs both R and Python with many pre-built packages installed. This cloud-hosted environment is call a “Kernel” in Kaggle speak. 
a. To explore the Kernel environment, click “Kernels” on the top after signing in.
b. Click “New Kernel” in the centre of the screen and you will be asked to select the data set that you want to work on.
c. Depending on the data set, you will be given a number of programming language options. 
d. Have a play with the environment for now with different data sets and based on what you have learned with R or SQL for example.

3. Kaggle also hosts a number of good tutorials, which you can access here https://www.kaggle.com/wiki/Tutorials
4. As a community site, you need to be aware of what is public and what is hidden. Kernels, data sets, competitions, etc., can be either public or private to invited individuals. Have a look around and make sure you understand your privacy settings.
5. Kaggle has an option call “Kaggle in class” which is used primarily for teaching and learning purposes. This will be the environment used for subsequent exercises, which gives us a more controlled environment for a classroom environment and for assignments.
a. If you like to explore this more, you can do so by going to https://inclass.kaggle.com.
b. There are a few smaller and good starting point data sets already pre-loaded for you to explore.

Answer:

The aim of the Kickstarter  Campaign is to predict the success or failure of a campaign. It is a multi class classification problem and has five levels namely canceled, failed, live, successful and suspended. It is extremely important to know the impact of the campaign before the launch of the project. Hence this analysis is a vital factor to analyze the funding goal.

Preliminary Analysis:

The projects available in the dataset has either of the five status mentioned above.  The project id column and the url does not seem to have any effect on the model building. Hence they have been removed from both the training and test dataset. The levels column has been converted into a factor in both training and test dataset.

Exploratory Analysis:

  • The goal of the successful projects was 3 times higher than the goal of the failed projects.
  • The pledged amount was of successful projects was 10 times higher than that of the failed projects.
  • The comments of successful projects were 14 times the comments of the failed projects.
  • The updates of successful projects were 4 times that of the failed ones.

Feature Selection

Feature selection is an extremely important aspect of model building. This has the ability to alter the accuracy to a great extent.

Since the data has both categorical and numerical attributes, the method chosen for feature selection is “Boruta”. (Analytics Vidya 2016)

Boruta gives  a clear call on the importance of the features. Boruta is popular for following all relevant feature selection method which captures the features that fall under some circumstances most relevant to the predictor variable. Unlike other traditional feature selection algorithms which follow minimal optimal method, this method relies on small subsets of features that yield minimal error.

Boruta checks all the features and displays which features are strongly relevant and weakly relevant to the predictor variable. The technique is extensively applied in medical field because of this feature.

Results of Boruta:

As per the results declared by boruta, the features that are strongly related to the decision variable are category, subcategory, goal, levels, duration and location.

Model Building:

Since this is a classification problem, the methods that can be used are KNN, Random Forest, Decision Tree, SVM, etc. However, the method that has been used is Decision Tree.

A decision tree classifier is a supervised learning method which poses a series of learned questions regarding the features of the training examples. Every time it receives an answer, a question is asked till a final conclusion regarding the label of the class is decided.

After creating a data frame which includes all the important features, a model is built on it using the library rpart.

The accuracy obtained on the training data is 98% and on the test data is 45%.

To improve results, another model that has been applied in SVM

SVM:

SVM, Support Vector Machine is an extremely popular discriminative algorithm that is widely used in the text classification problems as well. SVAM tries to fine the maximum margin hyper-plane that aims at separating the data based on class in the high dimensional space. This significantly solves the optimization problem. (Ray Sunil, 2017)

SVM is a great algorithm that helps in handling over fitting and also reduces the curse of dimensionality.

Tasks Performed By SVM:

Class Separation: SVM basically looks for a hyper plane that gets placed between two classes with maximum margin between the closest points of two classes. The points that first touch the hyper plane are called support vectors

Overlapping Classes: SVM reduces the weight of data points that are wrongly classified into wrong class i.e. it solves the problem of overlapping classes.Nonlinearity: SVM also solves the problem of non-linearity. (Siva, 2013)

When we cannot find a linear separator, we project the data points into higher dimensional space and ensure they can be linearly separated. Most of the time, when one does SVM, there is no need to do preprocessing. SVM is used for supervised classification and a supervised machine learning algorithm. It can be used for both classification and regression problems. The goal of SVM is to arrive at the optimal line that separates two sections using a hyper-plane.  SVM is efficient for linear and non linear classification. Non-linear classifications are handled by SVM using Kernal trick. Choice of the kernal for non-linear classifications is tricky and requires trial and error approach.

SVM also has been built on the pruned data simply by using the variables that boruta decided to be important.

Saving the results:

The results have been saved to scores.csv file and the header has been changed to product.id and status. After creation of the csv file, the file has been uploaded to kaggle.

Conclusion:

In this assignment we are trying to predict the class of the campaign using innovative machine learning and classification approaches. Several insights have been developed which can be used for better campaigning. The results explain that better incorporation of the failed projects can significantly improve the robustness of the prediction model and one can perform much better predictions. Another important method that can be used to predict the results of campaign is to use the twitter data and analyze the impact of the campaign on social media. For the above said example, along with SVM, Decision Trees also can be use. But it might take very long time on a computer with average capabilities and chances are high for the results to be unsatisfactory.

References

Available at : https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/ (Accessed: September 13, 2017)

Siva (2013) SVM Implementation Step By Step With R. Available at https://sivaanalytics.wordpress.com/2013/06/15/svm-implementation-step-by-step-with-r-data-preparation/ (Accessed: June 15, 2013)

Saxena R (2017) How Decision Algorithm Works. Available at: http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/ (Accessed: January 30, 2017 )

Analytics Vidya(2016 )How to perform feature selection (i.e. pick important variables) using Boruta Package in R ? Available at: https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/ (Accessed:  March 22, 2016).

Get An Awesome Price Quote For Your Paper – Absolutely FREE!
    Add File
    Files Missing!

    Please upload all relevant files for quick & complete assistance.

    Our Amazing Features

    delivery

    No missing deadline risk

    No matter how close the deadline is, you will find quick solutions for your urgent assignments.

    work

    100% Plagiarism-free content

    All assessments are written by experts based on research and credible sources. It also quality-approved by editors and proofreaders.

    time

    500+ subject matter experts

    Our team consists of writers and PhD scholars with profound knowledge in their subject of study and deliver A+ quality solution.

    subject

    Covers all subjects

    We offer academic help services for a wide array of subjects.

    price

    Pocket-friendly rate

    We care about our students and guarantee the best price in the market to help them avail top academic services that fit any budget.

    Getting started with MyEssayAssignmentHelp is FREE

    15,000+ happy customers and counting!

    Rated 4.7/5 based on
    1491 reviews
    ;