ITC516 Data Mining and Visualisation for Business Intelligence

Published : 17-Sep,2021 | Views : 10

Question:

The dataset EastWestAirlinesCluster.xls contains information on 3999 passengers who belong to an airline’s frequent flier program. For each passenger the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for the purpose of targeting different segments for different types of mileage offers.
a) Apply hierarchical clustering with Euclidean distance and Ward's method. Make sure to normalize the data first. How many clusters appear
b) What would happen if the data were not normalized
c) Compare the cluster centroid to characterize the different clusters, and try to give each cluster a label.
d) Use K-means clustering with the number of clusters that you found above. Does the same picture emerge
e) Which clusters would you target for offers, and what types of offers would you target to customers in that cluster.

Answer:

XLMiner is used applying association rule on data variables. The inputs are as highlighted below:

Number of transaction = 500

Number of variables = 7

Minimum confidence = 0.5 or 50%

XLMiner has generated the input and the part of lists of rules is as shown below:

After analyzing the list of rules generated by XLMiner, the following conclusion has been derived regarding the first three association rules (Ana, 2014).

Rule 1 – As per first association rule, it would be fair to conclude that the concerned person who first purchases brush would then purchase nail polish with an estimated confidence of 100%.

Rule 2 - As per second association rule, it would be fair to conclude that the concerned person who first purchases nail polish would then purchase brush with an estimated confidence of 63.22%.

Rule 3 - As per third association rule, it would be fair to conclude that the concerned person who first purchases nail polish would then purchase bronzer with an estimated confidence of 59.20%.

Redundancy of rule is a potential issue in the association rules and hence such rules need to be trimmed. A particular case of redundancy for the given data pertains to rule 2 which has exactly the same support or lift ratio as witnessed for rule 1. The only difference is that the confidence level for rule 2 is lower than the corresponding level for rule 1 which makes rule 2 inferior than rule 1 (Zaki, 2000).

Additionally, the utility of the association rule lies in the fact that they can enable identification of hidden associations prevalent in consumer buying behavior. However, in order to use the same a balance between support and confidence is required. This is because if the support is high, then the rules regarding rare items rule are not displayed. However, if the support is kept at a low value, hence the rules generated are quite more which tends to undermine the end utility of these and hence not recommended (Liebowitz, 2015).

In this case, the input is same only the minimum confidence has changed and become 0.75 in place of 0.50.

Minimum confidence = 0.75 or 75%

XLMiner has generated the input and the part of lists of rules is as shown below.

It would be fair to conclude that increase in the confidence percentage from 0.5 to 0.75, the number of list of association rules displayed is decreased. This is because only the association rules which has fall in the range above the selected minimum confidence percentage would appear. Hence, the rules which display lower confidence percentage as compared with minimum confidence percentage would be removed automatically through XLMiner. This could be problematic since the rules not displayed may have high support levels and hence significant (Ragsdale, 2014).

In regards to calculate the total number of cluster formed from data dendrogram needs to be prepared through XLMiner.

The above output was generated using the Ward method whereby three clusters were chosen which have been obtained and the relevant output confirms the same if a horizontal line is drawn at a distance of about 990, the dendogram would be intersected at three unique points reflecting that three clusters have been obtained.

The case when the standard normalization of data is not performed before conducting the clustering analysis, the below highlighted issues can be incurred (Abramowics, 2013).
It reduces the overall accuracy of the result.
It would create distortion of distance between the centroids and would adversely impact usage of these results.
Normalization would also transform the spherical clusters into elliptical clusters. This would create problem for the respective clustering algorithm.
Scale difference would also create problem especially when the variable has significantly high magnitude.
The labeling of the clusters ought to be performed on the basis of the common characteristics that these would display based on the hierarchical clustering which is highlighted below (Ana, 2014).
Non-flight bonus transactions quite minimal (Lowest in all clusters)
Flight transactions quite minimal in the past one year
Balance in terms of miles eligible for award travel is the lowest for all clusters

Appropriate Label: “Middle Class Flyers”

Key Observations:

Non-flight bonus transactions quite substantial (next to cluster 3)
Flight transactions the highest in the past one year
Balance in terms of miles eligible for award travel is the highest for all clusters
Also, some have high miles qualification for top flight status

Appropriate Label: “High Networth Flyers”

Key Observations:

Non-flight bonus transactions are highest amongst all clusters.
However, flight transactions comparatively very low in the past one year
Balance in terms of miles eligible for award travel is significant though lower than cluster 2.

Appropriate Label: “Non-frequent Flyers”

The XLMiner output in relation to K Means Clustering is indicated in the output listed below.

There is parity in terms of the number of cluster formed. However, the characteristics on close scrutiny would indicate that the underlying difference.

Highest miles eligible for award travel
Highest frequency of flight transactions in the year gone by
Highest miles counted for Top Flight status

In line with the above observations, it would be fair to consider that this cluster would be labeled as “High Networth Flyers”. However, the comparison with hierarchical clustering output clearly indicates the difference as in that case, it was cluster 2 that comprise of these flyers. Thus, it would be fair to assume that the output of K-Means Clustering varies with hierarchical clustering (Grossmann & Rinderle-Ma, 2015).

e) The clusters chosen for target owing to their current contribution and future potential are cluster 2 & 3 as per the hierarchical clustering. The offer for cluster 2 would involve incentive for higher use of frequent flyer card issued by the airline leading to higher reward points. Also, the bonus miles provided could be linked to a threshold annual check-ins.

For cluster 3, the incentive in the form of higher rewards point needs to be outlined so that the customer uses frequent flyer card and the bonus points utilized for flight transactions.

References

Abramowics, W. (2013) Business Information Systems Workshops: BIS 2013 International Workshops (5^th ed.). New York: Springer.

Ana, A. (2014) Integration of Data Mining in Business Intelligence System (4^th ed.). Sydney: IGA Global.

Grossmann, W. & Rinderle-Ma, S. (2015) Fundamentals of Business Intelligence (2^nd ed.). New York: Springer.

Liebowitz, J. (2015) Business Analytics: An Introduction (2^nd ed.). New York: CRC Press.

Ragsdale, C. (2014) Spread sheet Modelling and Decision Analysis: A Practical Introduction to Business Analytics (7^th ed.). London: Cengage Learning.

Zaki, M.J.(2000), Generating non-redundant association rules. In: Proceeding of the ACM SIGKDD, pp. 34–43.