Clustering is generally done using two methods (and the choice of number of clusters are very different in both of them given the difference of use cases) :
- Hierarchical Clustering : This type is generally used when the number of observations you have are very less (Typically less than 1000). The choice of number of clusters is very easy for this type, as you will finally get a dendrogram on which you will see which observations can be clubbed well and given a threshold on the maximum distance between groups, you will automatically get the number of clusters.
- K-means clustering : This is the more widely used technique as it can be applied on a much bigger volume. The optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. As a thumb rule you can set k as square root of number of observations/2. Else, this can be done using multiple methods (statistically) :
a. Find the elbow in a variance to number of cluster plot : Find the number of cluster beyond which the variance explained is significantly reduced compared to past increments.
b. Check the maximum of CCC plot and choose that value as k.
c. Other methods include looking at IC values (BIC/AIC) and cross validations, which are rarely used.
Hope this helps,