JOSEPH TRUE
  • HOME
  • DATA
  • SOFTWARE
  • About
Data Projects > Data Mining and Exploration

Text Mining with Python and the Natural Language Toolkit

For this project, I used Python and the Natural Language Toolkit (NLTK) to text mine 129,000 documents containing abstracts of National Science Foundation (NSF) awards for basic research.
Picture
Graduate course: CS 548 - Knowledge Discovery and Data Mining
Date: April, 2015,
Tools
:
  • Python
  • Natural Language Toolkit (http://www.nltk.org/)
  • Weka (http://www.cs.waikato.ac.nz/ml/weka/)
  • Google Ngram Viewer tool (https://books.google.com/ngrams)
  • Wordle - word cloud tool (http://www.wordle.net)

Data:
National Science Foundation (NSF) Research Award Abstracts 1990-2003 Data Set (http://archive.ics.uci.edu/ml/datasets/NSF+Research+Award+Abstracts+1990-2003)

Process

I organized the files into sub-folders.  Each sub-folder had about 10,000 documents.
Picture
I used Python to iterate through each sub-folder, from 1990 - 2003, and count the occurrence specific research words.
Picture
For example, I counted the occurrence of the word "hubble" for each of the years and output the results in a simple list.
Picture
Then I plotted those words across the years for comparison.
Picture
For some words, I also looked up them up in Google's Ngram Viewer to see how that word occurred in general during the same time. Here you can see the occurrence of "internet" in the NSF research documents compared with its occurrence as shown by Google Ngram Viewer.
Picture
Here's the Google Ngram Viewer for "internet","mobile computing" and "hubble space".

Powered by Create your own unique website with customizable templates.
  • HOME
  • DATA
  • SOFTWARE
  • About