DEV Community

Discussion on: What's the best way to get started with machine learning?

Collapse
 
mortoray profile image
edA‑qa mort‑ora‑y

Classifying documents, or websites, can be an interesting way to start. Start with a list of domains that you like on a topic, and a bunch of unrelated domains. Now try to build a system that will pick out the other domains you like. This will introduce several of the concepts behind machine learning:

  • attribute extraction: picking data out of the websites for use as attributes (this can be as simple as parsing the HTML and pulling out words)
  • training: using known good data (a subset of the domains you like) to train the system
  • classification: let the system run on the new data. You have only two categories at this point so it should be easy to manually confirm the results.

I'm specifically not including any technologies or programming languages here. All languages have numerous libraries, and just following the concept words, or just tracing machine learning should pick out many. But the above gives the basic outline of what you're trying to do, to help guide the search.

(Note, this basic task doesn't involve neural networks. Don't get side-tracked, at least not yet at least.)

Once you get into it you'll start learning about the statistical models being used, and can start branching off. You can look at classifying more categories. Identifying attributes automatically. Correlating documents to each other. Pattern and behavior prediction.