DEV Community

Cover image for Advanced Scripting for Orange Data Mining
Philippe Arteau
Philippe Arteau

Posted on • Updated on

Advanced Scripting for Orange Data Mining

Have you ever heard of Orange Data Mining? It is a powerful data mining framework that includes Python API and a visual GUI.

This blog post is for people looking to create add new attributes dynamically to your data set. It will also provide a good example of scripting in Orange.

The Problem

I have a CSV dataset with an attribute that has multi-values possible and I want to create a new attribute for every value present in the multi-values attribute.

Here is a simple example that is easier to understand than my actual dataset.

Initial dataset

country flag_colors
italy green,white,red
united kingdom red,blue,white
russia white,blue,red
canada red,white
brazil green,blue,yellow
germany black,red,yellow

Target dataset

country ... green white red blue yellow black
italy ... 1 1 1 0 0 0
united kingdom ... 0 1 1 1 0 0
russia ... 0 1 1 1 0 0
canada ... 0 1 1 0 0 0
brazil ... 1 0 0 1 1 0
germany ... 0 0 1 0 1 1

Custom Script Solution

I have annotated my solution.

from Orange.data import Table, Domain, ContinuousVariable, DiscreteVariable
import copy

attributes_to_expand = ['colors']
separator = "|" #Separator can be changed to , ; - 

# Structure used to build the new model
attributes_to_keep = []
class_vars_to_keep = []
metas_to_keep = []
variables=[] # Old and new variables
all_values = dict() #values to keep for each multi-values columns

# Building a list of all the known values for each column to expand
for data in in_data:
    for att_exp in attributes_to_expand:
        values = data[att_exp].value.split(separator)
        for v in values:
            if(not(att_exp in all_values)):
                all_values[att_exp] = set() #One new attribute per value maximum
            all_values[att_exp].add(v)

# Keeping existing metadata and class variables (target)
for orig_meta in in_data.domain.metas:
    metas_to_keep.append(orig_meta)
for orig_class_var in in_data.domain.class_vars:
    class_vars_to_keep.append(orig_class_var)

# Keeping non-multi-values variables
for orig_var in in_data.domain.variables:
    if(orig_var.name in attributes_to_expand or orig_var in class_vars_to_keep):
        continue
    variables.append(copy.copy(orig_var))
    attributes_to_keep.append(orig_var)


# Adding the list of all the new variables
for att_exp in attributes_to_expand:

    for v in all_values[att_exp]:
        variables.append(ContinuousVariable(att_exp+"="+v))

# Output Table construction

## Domain describes the variables of our dataset
domain = Domain(variables,class_vars=class_vars_to_keep,metas=metas_to_keep)

## Table include both the domain definition and the data
table = Table.from_domain(domain,len(in_data))

## Rebuilding the data line by line
for index,data in enumerate(in_data):
    # Variables that we keep as-is
    for att in class_vars_to_keep:
        table[index][att] = data[att].value
    for att in attributes_to_keep:
        table[index][att] = data[att].value
    for meta in metas_to_keep:
        table[index][meta] = data[meta].value

    # New variables
    for att_exp in attributes_to_expand:
        for v in all_values[att_exp]:
            values_for_current_line = data[att_exp].value.split(separator)
            value_is_present = v in values_for_current_line
            table[index][att_exp+"="+v] = True if value_is_present else False

# making the new dataset available to linked widget
out_data = table
Enter fullscreen mode Exit fullscreen mode

Tips for Building Scripts

  • Don't copy objects such as Domain's variable from in_data to out_data. Altering in_data can cause side effects.
  • Don't forget to class variables and metadatas. The information could be helpful later in your pipeline.
  • print() debugging is your friend. There is also an interactive console that is very helpful.
  • Some shortcuts are available [Ctrl]-R: Run script, [Ctrl]-R: Save script, [Ctrl]-/: Comment line ... There is also a Vim mode for the purist. 😉
  • Look at existing transformations before building your own

Conclusion

Transforming multi-values columns into multiple attributes is very helpful to apply machine learning algorithm to classify and do prediction based on samples data.

Maybe this small script will save you some time...

References

Top comments (1)

Collapse
 
h3xstream profile image
Philippe Arteau

UPDATE: There is now an official plugin that do the same thing !
twitter.com/OrangeDataMiner/status...