skip to main content
research-article

The WEKA data mining software: an update

Published:16 November 2009Publication History
Skip Abstract Section

Abstract

More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

References

  1. I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludscher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. In In SSDBM, pages 21--23, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Bennett and M. Embrechts. An optimization perspective on kernel partial least squares regression. In J.S. et al., editor, Advances in Learning Theory: Methods, Models and Applications, volume 190 of NATO Science Series, Series III: Computer and System Sciences, pages 227--249. IOS Press, Amsterdam, The Netherlands, 2003.Google ScholarGoogle Scholar
  3. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth International Group, Belmont, California, 1984.Google ScholarGoogle Scholar
  4. S. Celis and D.R. Musicant. Weka-parallel: machine learning in parallel. Technical report, Carleton College, CS TR, 2002.Google ScholarGoogle Scholar
  5. C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.Google ScholarGoogle Scholar
  6. T.G. Dietterich, R.H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell., 89(1-2):31--71, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dietzsch, N. Gehlenborg, and K. Nieselt. Maydaya microarray data analysis workbench. Bioinformatics, 22(8):1010--1012, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Dong, E. Frank, and S. Kramer. Ensembles of balanced nested dichotomies for multi-class problems. In Proc 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, pages 84--95. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning. Research, 9:1871--1874, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Frank and S. Kramer. Ensembles of nested dichotomies for multi-class problems. In Proc 21st International Conference on Machine Learning, Banff, Canada, pages 305--312. ACM Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Gaizauskas, H. Cunningham, Y. Wilks, P. Rodgers, and K. Humphreys. GATE: an environment to support research and development in natural language engineering. In In Proceedings of the 8th IEEE International Conference on Tools with Artificial Intelligence, pages 58--66, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gama. Functional trees. Machine Learning, 55(3):219--250, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Genkin, D.D. Lewis, and D. Madigan. Largescale bayesian logistic regression for text categorization. Technical report, DIMACS, 2004.Google ScholarGoogle Scholar
  14. J.E. Gewehr, M. Szugat, and R. Zimmer. BioWeka-extending the weka framework for bioinformatics. Bioinformatics, 23(5):651--653, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Hall and E. Frank. Combining naive Bayes and decision tables. In Proc 21st Florida Artificial Intelligence Research Society Conference, Miami, Florida. AAAI Press, 2008.Google ScholarGoogle Scholar
  16. K. Hornik, A. Zeileis, T. Hothorn, and C. Buchta. RWeka: An R Interface to Weka, 2009. R package version 0.3-16.Google ScholarGoogle Scholar
  17. L. Jiang and H. Zhang. Weightily averaged onedependence estimators. In Proceedings of the 9th Biennial Pacific Rim International Conference on Artificial Intelligence, PRICAI 2006, volume 4099 of LNAI, pages 970--974, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Khoussainov, X. Zuo, and N. Kushmerick. Gridenabled Weka: A toolkit for machine learning on the grid. ERCIM News, 59, 2004.Google ScholarGoogle Scholar
  19. M.-A. Krogel and S. Wrobel. Facets of aggregation approaches to propositionalization. In T. Horvath and A. Yamamoto, editors, Work-in-Progress Track at the Thirteenth International Conference on Inductive Logic Programming (ILP), 2003.Google ScholarGoogle Scholar
  20. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, editors, KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 935--940, New York, NY, USA, August 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Nadeau. Balie-baseline information extraction : Multilingual information extraction from text with machine learning and natural language techniques. Technical report, University of Ottawa, 2005.Google ScholarGoogle Scholar
  22. G. Piatetsky-Shapiro. KDnuggets news on SIGKDD service award. http://www.kdnuggets.com/news/2005/n13/2i.html, 2005.Google ScholarGoogle Scholar
  23. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2006. ISBN 3-900051-07-0.Google ScholarGoogle Scholar
  24. J.J. Rodriguez, L.I. Kuncheva, and C.J. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619--1630, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Sandberg. The haar wavelet transform. http://amath.colorado.edu/courses/5720/2000Spr/Labs/Haar/haar.html, 2000.Google ScholarGoogle Scholar
  26. M. Seeger. Gaussian processes for machine learning. International Journal of Neural Systems, 14:2004, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  27. C. Shearer. The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5(4), 2000.Google ScholarGoogle Scholar
  28. H. Shi. Best-first decision tree learning. Master's thesis, University of Waikato, Hamilton, NZ, 2007. COMP594.Google ScholarGoogle Scholar
  29. N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 129--136, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Su, H. Zhang, C.X. Ling, and S. Matwin. Discriminative parameter learning for bayesian networks. In ICML 2008, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Talia, P. Trunfio, and O. Verta. Weka4ws: a wsrfenabled weka toolkit for distributed data mining on grids. In Proc. of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2005, pages 309--320. Springer-Verlag, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K.M. Ting and I.H. Witten. Stacking bagged and dagged models. In D. H. Fisher, editor, Fourteenth international Conference on Machine Learning, pages 367--375, San Francisco, CA, 1997. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J.S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2 edition, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G. Nevill-Manning. Kea: Practical automatic keyphrase extraction. In Y.-L. Theng and S. Foo, editors, Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pages 129--152. Information Science Publishing, London, 2005.Google ScholarGoogle Scholar
  37. X. Xu. Statistical learning in multiple instance problems. Master's thesis, Department of Computer Science, University of Waikato, 2003.Google ScholarGoogle Scholar
  38. Y. Yang, X. Guan, and J. You. CLOPE: a fast and effective clustering algorithm for transactional data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 682--687. ACM New York, NY, USA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. F. Zheng and G.I. Webb. Efficient lazy elimination for averaged-one dependence estimators. In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006), pages 1113--1120. ACM Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The WEKA data mining software: an update

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGKDD Explorations Newsletter
        ACM SIGKDD Explorations Newsletter  Volume 11, Issue 1
        June 2009
        56 pages
        ISSN:1931-0145
        EISSN:1931-0153
        DOI:10.1145/1656274
        Issue’s Table of Contents

        Copyright © 2009 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 November 2009

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader