Breakthrough technology in artificial intelligence results from a felicitous combination of new ideas with a surprising mixture of new and old technology. Machine learning is popularly conflated with artificial intelligence, but this is erroneous because machine learning and also deep learning are subsets of the broader field of AI. In this article we will explore a popular method of ML called K-Means Clustering, and discover the applications and limitations of such ML technology in use today. This will give us the basis for understanding how ML applies to AI research.

Methods of machine learning can be further subdivided into two areas called supervised and unsupervised learning. Supervised learning methods such as “support vector machines” use linear regression methods which are mathematically similar to traditional actuarial and other long established statistical forecasting methods. SVMs are trained by feeding the model a dataset of known accurate values. Equations for the model are built from the dataset, with parameters constructed from features which are equated with target values. Usually, a portion of this dataset will be reserved to test the accuracy of the model after training.

A clear and easily understood example of such regression models is used to forecast the price of houses based on recent historical real estate sales data. Such a dataset may contain the actual sale price of thousands of recent sales along with hundreds of descriptive features. The features of such a dataset are the attributes which have influence in determining the price, such as number of bedrooms, square footage, neighborhood and zip code, which may indicate a level of affluence or preferred school district. These features and target prices are used to construct the equations which become the model for prediction of future house prices. What distinguishes this type of model as a supervised training method is the fact that we have a set of known target values, in this case actual sale prices, to match with varying precision to the features.

By contrast, unsupervised learning methods require more diversity in method and strategy because target values are not known in advance. In many cases we don’t even know what we’re looking for in the data! Quite naturally, many of the circumstances we wish to forecast do not come with neat equations and a set of known previous outcomes to train on. Unsupervised methods intend to recognize patterns in data without knowing anything about the data in advance. When a model of this kind works, it can produce results which are surprisingly salient, and this gives us a hint of emerging intelligence.

Our example here figures among these unsupervised methods, and is called K-Means Clustering. To quickly demystify the technical jargon, imagine a grassy field where horses of many colors are grazing. Three shepherds also roam about the field, each wearing a Kaftan of a unique color. As the shepherd wearing the brown kaftan wanders along, the brown horses tend to cluster around him. White horses gravitate toward the shepherd with the white kaftan, and so on. After a time, all the horses have chosen a shepherd, even the khaki colored cow, which hovers somewhere between brown and white. In the language of K-Means, the shepherds are centroids, and the color of the horses is the feature of our simple dataset we wish to classify by clustering them around a centroid.

In practice it is possible to run K-Means on data where we know nothing at all in advance! In fact, K-Means is well suited for operating on datasets which have no labels. Of course the actual mechanics of K-Means Clustering is a substantial matter. Yet, we can get a clear concept of how it works by running a real world model ourselves, using Python and the SKLearn library of machine learning functions. In our practical example, we will use K-Means Clustering to classify news articles by subject.

Classifying News Articles

In this practical example we will use the Reddit API to download a sample of their live news articles to use with the K-Means Clustering algorithm provided in the Python SKLearn library of machine learning functions. In summary, we will:

  • Use an API to download news article text
  • Read text from the news website
  • Execute K-Means Cluster analysis
  • Classify news articles by topic

To follow our example here, you will need to get a developer key from the  Reddit website, which will serve as password for your app to download news text from the site. Open your favorite Python editor or a command line in the folder for a new project and enter this code to get started:

Client_ID = "your Reddit ID goes here"

Client_Secret = "your Reddit user secret goes here"

User_Agent = "your Reddit username goes here"

Username = "your Reddit username"

Password = "your Reddit password"

Next, we define a function for login, and return the user token which will be used in future calls to the API:

headers = {"User-Agent": User_Agent}

client_auth = requests.auth.HTTPBasicAuth(Client_ID, Client_

Secret)

post_data = {"grant_type": "password", "username": Username,

"password": Password}

response = requests.post("https://www.reddit.com/api/v1/access_

token", auth=client_auth, data=post_data, headers=headers)

return response.json()

Call the login function to return the user token as:

token = login(Username, Password)

And receive a token return like this example:

Next we get the headers, and create a Python dictionary with: