DEV Community

loading...
Cover image for Organize your Rasa 2.0 training data like a boss

Organize your Rasa 2.0 training data like a boss

jonathanpwheat profile image Jonathan Wheat ・6 min read

Rasa 2.x gives us a lot of new features as conversational designers / developers. The coolest feature that isn't quite apparent is the ability to group and organize your training data any way you please.

At my company we build "abilities" for our bot - think of them similar to Alexa Skills. Essentially you have some domain information, nlu training data, stories and sometimes actions or forms to round out the ability. The more abilities the bot has, the longer and more complex your files get; which leads to a longer time to grock what you're looking at if you drop in to tweak something, let alone onboard a new developer.

The old days

Rasa 1.x allowed you to split up your nlu and stories files simply by creating a data/nlu, and data/core directory in your project and putting the individual files there. You can group your data into separate files which makes it easier to find something if/when you need to change something. For example if you needed to add new chit-chat training data, you could jump into data/nlu/chit-chat.md and add new data. Initiating the rasa train command utilizes the files in data/nlu and data/core in combination with domain.yml in the root of your project to train your model.

This was great, but not ideal for me. I built a script to let me split my domain files in a similar way; creating a data/domain directory and putting my files there. Rasa however, didn't recognize that directory, so I wrote a script to merge these files into a single domain.yml file and drop it in the root. This allowed the rasa train command to utilize my separate domain related files.

A New Organizational Paradigm

Rasa 2.x gives us the ability to split up our domain files and the benefit to that is clear; smaller files with more focused data. I also don't have to utilize my custom script now!

Why is this cool? To expand on my explanation above; if your bot can handle chit-chat, weather, restaurant search, and directions you would have a single long domain.yml file in the root of your project with ALL of your intents, slots, entities, responses, action calls, and form config. Your topical data is interlaced, and it makes it hard to find things. Being able to split this into different files just makes more sense. (Thank you Rasa!)

Your new data directory structure can now change to -

data/core
data/domain
data/nlu
Enter fullscreen mode Exit fullscreen mode

And each of these can contain multiple files that make up your bot's data. You can even do this with your action files.

Kick It Up A Notch!

Here's something that is an amazing side effect / undocumented feature of the way Rasa deals with training data in 2.x You can create directories under core,domain, and nlu and Rasa will recurse down through looking for files during the training process.

I know you're asking - why is this awesome? In our case, as I said we build abilities, which are mostly isolated functions and conversational scenarios. In v1 we adopted a filename convention to differentiate between abilities. In v2, by exploiting this new directory structure we can have individual developers work on a single ability without stepping on the toes of other developers.

They can create a new ability directory - let's say they're working on a book recommendation ability. Our dev creates data/book-recommendation and in that directory creates a domain.yml, nlu.yml, stories.yml, rules.yml and works solely from that directory. Fun fact, the filename doesn't matter. Each .yml file is keyed - intents:,nlu:, even rules: so it doesn't matter how many files you have, or what the names are, it all works!

If you decide to do this, you'll need to run rasa train with the --domain parameter so it will find your domain files

rasa train --domain data
Enter fullscreen mode Exit fullscreen mode

If you leave off the --domain parameter, Rasa will look for domain.yml in the directory you're running it from so be sure to delete domain.yml in the root of your project, or you may be quite confused why your latest changes aren't getting pulled in.

Don't Leave Actions Out In The Cold

You can also do this with your action.py file, albeit in a different location and there's an extra file. We create an ability directory under actions/, drop in an empty __init__.py file (making python treat it as a package) then add an actions.py file (or whatever filename you want)

In our book recommendation example we would have something like this:

actions/__init__.py
actions/book-recommendation 
actions/book-recommendation/__init__.py
actions/book-recommendation/actions.py
Enter fullscreen mode Exit fullscreen mode

Doing all of this directory organization centralizes the code and lets your developer spin up a local rasa init project, and work on that ability from beginning to end, creating a very focused bot complete with tests. One caveat is if the ability being worked on is integrated with another ability in some way. Depending on the level of that reliability, you may think about whether the new code is actually a separate/new function as opposed to an extension of a current ability - but that's getting away from the main topic here.

In practice

We have plan to have a special repo of abilities, so when our devs are done they can just move their directories over issue a PR and that new ability will be available for everyone else on the team to pull down and add to their bot if needed.

But wait, there's more

Up until now, if your ability had any python related action code, you'd have 2 directories to manage. What if you could create a truly self-contained ability in one directory. Literally add one directory of files, retrain and have a new ability in your bot?

You can.

To achieve this, we'll move the action files into our data/book-recommendation directory. There's some setup to do this however.
Remember the __init__.py files we've been dropping all over? Python uses those to detect if a directory is loadable (a package).

To get our all-in-one setup we'll need to drop an __init__.py into data and then move our __init__.py and the specific actions.py file from our actions' ability folder into our data's ability directory. This way everything is 100% self-contained in one single directory like this:

data/__init__.py
data/book-recommendation
data/book-recommendation/__init__.py
data/book-recommendation/actions.py
data/book-recommendation/stories.yml
data/book-recommendation/domain.yml
data/book-recommendation/nlu.yml
Enter fullscreen mode Exit fullscreen mode

The trick here is to run your action server with the --actions parameter like this:

rasa run actions --actions data
Enter fullscreen mode Exit fullscreen mode

This tells rasa to load the actions files from your data directory, and it will recurse down and load any python files it finds.

As noted above, you'll also need to run rasa train with the --data parameter like this:

rasa train --domain data
Enter fullscreen mode Exit fullscreen mode

That will tell Rasa where your domain files are.

Parting Thoughts

I think this is a pretty cool advancement in the ability (no pun intended) to organize our data, streamline our development process and allow a very interesting approach to developing different independent functions.

I'm not sure I like the python being intermingled in the same directory as my .yml files, it feels a little gross, but I supposed I could also create a data/book-recommendation/actions directory to move out all the python other than the __init__.py file of course. Or maybe even go crazy with

/data/book-recommendation/actions
/data/book-recommendation/data
Enter fullscreen mode Exit fullscreen mode

OR even rename our data directory and create something like this:

/abilities/book-recommendation/actions
/abilities/book-recommendation/data
Enter fullscreen mode Exit fullscreen mode

If you do something crazy like this be sure to alter your --data and --action parameters when firing up Rasa!

Those both feel a little over the top, but the point is the possibilities are endless and you have the ability to organize your files however makes sense to you.

I'll continue iterating on this approach. I'm interested in knowing what others are doing to organize their data. The single file system works for smaller / simpler bots, but anything with some robustness will quickly outgrow that model (pun intended).

Let me know what you think in the comments!

Discussion (0)

pic
Editor guide