The issue of string vs double vs categorised is (has to be!) solved by Weka already. How can we leverage their solution to make our code simpler and cleaner?
Our code needs to support representing statistical distributions of variables besides just plain values but that is about the only difference, I think.