Mansa Keïta

Posted on Oct 27, 2021 • Edited on May 25, 2022

Modeling semi-structured data in Rails

#ruby #rails #json #activemodel

Relational databases are very powerful. Their power comes from their ability to...

Preserve data integrity with a predefined schema.
Make complex relationships through joins.

But sometimes, we can stumble accross data that don't fit in the relational model. We call this kind of data: semi-structured data.
When this happens, the things that makes relational databases powerful are the things that gets in our way, and complicate our model instead of simplifying it.

That's why document databases exist, to model and store semi structured data. However, if we choose to use a document database, we'll loose all the power of using a relational database.

Luckily for us, relational databases like Postgres and MySQL now has good JSON support. So most of us won't need to use a document database like MongoDB, as it would be overkill. Most of the time, we only need to denormalize some parts of our model. So it makes more sense to use simple JSON columns for those, instead of going all-in, and dump your beloved relational database for MongoDB.

Currently in Rails, we can have full control over how our JSON data is stored and retrieved from the database, by using the Attributes API to serialize and deserialize our data. So let's see how we can model semi-structured data in a more convenient way.

Use case: Dealing with bibliographic data

Let's say that we are building an app to help libraries build and manage an online catalog. When we're browsing through a catalog, we often see item information formatted like this:

Author:        Shakespeare, William, 1564-1616.
Title:         Hamlet / William Shakespeare.
Description:   xiii, 295 pages : illustrations ; 23 cm.
Series:        NTC Shakespeare series.
Local Call No: 822.33 S52 S7
ISBN:          0844257443
Series Entry:  NTC Shakespeare series.

But in the library world, data is produced and exchanged is this form:

LDR 00815nam  2200289 a 4500
001 ocm30152659
003 OCoLC
005 19971028235910.0
008 940909t19941994ilua          000 0 eng
010   $a92060871
020   $a0844257443
040   $aDLC$cDLC$dBKL$dUtOrBLW
049   $aBKLA
099   $a822.33$aS52$aS7
100 1 $aShakespeare, William,$d1564-1616.
245 10$aHamlet /$cWilliam Shakespeare.
264  1$aLincolnwood, Ill. :$bNTC Pub. Group,$c[1994]
264  4$cÂ©1994.
300   $axiii, 295 pages :$billustrations ;$c23 cm.
336   $atext$btxt$2rdacontent.
337   $aunmediated$bn$2rdamedia.
338   $avolume$bnc$2rdacarrier.
490 1 $aNTC Shakespeare series.
830  0$aNTC Shakespeare series.
907   $a.b108930609
948   $aLTI 2018-07-09
948   $aMARS

This is what we call a MARC (Machine-Readable Cataloging) record. That's how libraries describes the ressources they own.

As you can see, that's really verbose! That's because in the library world, ressources are described very precisely, in order to be "machine-readable".

For convenience, developers usually represent MARC data in JSON:

{
  "leader": "00815nam 2200289 a 4500",
  "fields": [
    { "tag": "001", "value": "ocm30152659" },
    { "tag": "003", "value": "OCoLC" },
    { "tag": "005", "value": "19971028235910.0" },
    { "tag": "008", "value": "940909t19941994ilua 000 0 eng " },
    { "tag": "010", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "92060871" }] },
    { "tag": "020", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "0844257443" }] },
    { "tag": "040", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "DLC" }, { "code": "c", "value": "DLC" }, { "code": "d", "value": "BKL" }, { "code": "d", "value": "UtOrBLW" } ] },
    { "tag": "049", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "BKLA" }] },
    { "tag": "099", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "822.33" }, { "code": "a", "value": "S52" }, { "code": "a", "value": "S7" } ] },
    { "tag": "100", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "Shakespeare, William," }, { "code": "d", "value": "1564-1616." } ] },
    { "tag": "245", "indicator1": "1", "indicator2": "0", "subfields": [{ "code": "a", "value": "Hamlet" }, { "code": "c", "value": "William Shakespeare." } ] },
    { "tag": "264", "indicator1": " ", "indicator2": "1", "subfields": [{ "code": "a", "value": "Lincolnwood, Ill. :" }, { "code": "b", "value": "NTC Pub. Group," }, { "code": "c", "value": "[1994]" } ] },
    { "tag": "264", "indicator1": " ", "indicator2": "4", "subfields": [{ "code": "c", "value": "©1994." }] },
    { "tag": "300", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "xiii, 295 pages :" }, { "code": "b", "value": "illustrations ;" }, { "code": "c", "value": "23 cm." } ] },
    { "tag": "336", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "text" }, { "code": "b", "value": "txt" }, { "code": "2", "value": "rdacontent." } ] },
    { "tag": "337", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "unmediated" }, { "code": "b", "value": "n" }, { "code": "2", "value": "rdamedia." } ] },
    { "tag": "338", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "volume" }, { "code": "b", "value": "nc" }, { "code": "2", "value": "rdacarrier." } ] },
    { "tag": "490", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
    { "tag": "830", "indicator1": " ", "indicator2": "0", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
    { "tag": "907", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": ".b108930609" }] },
    { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "LTI 2018-07-09" }] },
    { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "MARS" }] }
  ]
}

By looking at this JSON representation, we can see that the data is...

Nested: A MARC record contains many fields, and most of them contains multiple subfields.
Dynamic: Some fields are repeatable ("264" and "948"), and subfields too. The first fields don't have subfields nor indicators (they're called control fields).
Encapsulated: The meaning of subfields depends on the field they're in (take a look at the "a" subfield for example).

All those characteristics can be grouped into what we call: semi-structured data.

Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. - Wikipedia

A perfect example of that is HTML documents. An HTML document contains different types of tags which can nested in multiple ways. It wouldn't make sense to model HTML documents with tables and columns. Imagine having to access nested tags through joins, considering the fact that we could potentially have hundreds of them on a single HTML document. That's why we usually store this kind of data in a text field.

In our case, we're using JSON to represent MARC data. Luckily for us, we can store JSON data directly in relational databases like Postgres or MySQL:

# config/initializers/inflections.rb
ActiveSupport::Inflector.inflections(:en) do |inflect|
  inflect.acronym "MARC"
end

$ rails g model marc/record leader:string fields:json
$ rails db:migrate

We can then create a MARC record like this:

MARC::Record.create leader: "00815nam 2200289 a 4500", fields: [
  { "tag": "001", "value": "ocm30152659" },
  { "tag": "003", "value": "OCoLC" },
  { "tag": "005", "value": "19971028235910.0" },
  { "tag": "008", "value": "940909t19941994ilua 000 0 eng " },
  { "tag": "010", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "92060871" }] },
  { "tag": "020", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "0844257443" }] },
  { "tag": "040", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "DLC" }, { "code": "c", "value": "DLC" }, { "code": "d", "value": "BKL" }, { "code": "d", "value": "UtOrBLW" } ] },
  { "tag": "049", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "BKLA" }] },
  { "tag": "099", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "822.33" }, { "code": "a", "value": "S52" }, { "code": "a", "value": "S7" } ] },
  { "tag": "100", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "Shakespeare, William," }, { "code": "d", "value": "1564-1616." } ] },
  { "tag": "245", "indicator1": "1", "indicator2": "0", "subfields": [{ "code": "a", "value": "Hamlet" }, { "code": "c", "value": "William Shakespeare." } ] },
  { "tag": "264", "indicator1": " ", "indicator2": "1", "subfields": [{ "code": "a", "value": "Lincolnwood, Ill. :" }, { "code": "b", "value": "NTC Pub. Group," }, { "code": "c", "value": "[1994]" } ] },
  { "tag": "264", "indicator1": " ", "indicator2": "4", "subfields": [{ "code": "c", "value": "©1994." }] },
  { "tag": "300", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "xiii, 295 pages :" }, { "code": "b", "value": "illustrations ;" }, { "code": "c", "value": "23 cm." } ] },
  { "tag": "336", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "text" }, { "code": "b", "value": "txt" }, { "code": "2", "value": "rdacontent." } ] },
  { "tag": "337", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "unmediated" }, { "code": "b", "value": "n" }, { "code": "2", "value": "rdamedia." } ] },
  { "tag": "338", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "volume" }, { "code": "b", "value": "nc" }, { "code": "2", "value": "rdacarrier." } ] },
  { "tag": "490", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
  { "tag": "830", "indicator1": " ", "indicator2": "0", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
  { "tag": "907", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": ".b108930609" }] },
  { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "LTI 2018-07-09" }] },
  { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "MARS" }] }
]

And access it this way:

record = MARC::Record.first
field = record.fields.find { |field| field["tag"] == "245" }
subfield = field["subfields"].first
subfield["value"]
=> "Hamlet"

It works, but...

It's not very convenient to access nested data this way.
We cannot easily attach logic to our JSON data without polluting our model.

What if we could interact with our JSON data the same way we do with ActiveRecord associations ? Enters ActiveModel and the AttributesAPI!

First, we have to define a custom type which...

Maps JSON objects to ActiveModel-compliant objects.
Handles collections.

To do that, we'll add the following options to our type:

:class_name: The class name of an ActiveModel-compliant object.
:collection: Specify if the attribute is a collection. Default to false.

class DocumentType < ActiveModel::Type::Value
  attr_reader :document_class, :collection

  def initialize(class_name:, collection: false)
    @document_class = class_name.constantize
    @collection     = collection
  end

  def cast(value)
    if collection
      value.map { |attributes| process attributes }
    else
      process value
    end
  end

  def process(value)
    document_class.new(value)
  end

  def serialize(value)
    value.to_json
  end

  def deserialize(json)
    value = ActiveSupport::JSON.decode(json)

    cast value
  end

  # Track changes
  def changed_in_place?(old_value, new_value)
    deserialize(old_value) != new_value
  end
end

Let's register our type as we gonna use it multiple times:

# config/initializers/type.rb
ActiveModel::Type.register(:document, DocumentType)
ActiveRecord::Type.register(:document, DocumentType)

Now we can use it in our models:

class MARC::Record < ApplicationRecord
  attribute :fields, :document,
    class_name: "MARC::Record::Field",
    collection: true

  def at(tag)
    fields.find { |field| field.tag == tag }
  end
end

class MARC::Record::Field
  include ActiveModel::Model
  include ActiveModel::Attributes
  include ActiveModel::Serializers::JSON

  attribute :tag, :string
  attribute :value, :string
  attribute :indicator1, :string
  attribute :indicator2, :string
  attribute :subfields, :document,
    class_name: "MARC::Record::Field::Subfield",
    collection: true

  # Control fields don't have subfields
  def attributes
    if control_field?
        {
          "id" => id,
          "tag" => tag,
          "value" => value
        }
      else
        {
          "id" => id,
          "tag" => tag,
          "indicator1" => indicator1,
          "indicator2" => indicator2,
          "subfields" => subfields
        }
      end
  end

  def control_field?
    /00\d/ === tag
  end

  def at(code)
    subfields.find { |subfield| subfield.code == code }
  end

  alias [] at

  # Used to track changes
  def ==(other)
    attributes == other.attributes
  end
end

class MARC::Record::Field::Subfield
  include ActiveModel::Model
  include ActiveModel::Attributes
  include ActiveModel::Serializers::JSON

  attribute :code, :string
  attribute :value, :string

  def ==(other)
    attributes == other.attributes
  end
end

Let's test this in the console:

record.at("245")["a"].value
=> "Hamlet"

record.changed?
=> false

record.at("245")["a"].value = "Romeo and Juliet"
record.at("245")["a"].value
=> "Romeo and Juliet"

record.changed?
=> true

Et voilà! Home-made associations!

Luckily, you won't need to implement this yourself, as this gem does it for you (and even more).

Here's how we can simplify our models:

class MARC::Record < ApplicationRecord
  include ActiveModel::Embedding::Associations

  embeds_many :fields

  # ...
end

class MARC::Record::Field
  include ActiveModel::Embedding::Document

  # ...

  embeds_many :subfields

  # ...
end

class MARC::Record::Field::Subfield
  include ActiveModel::Embedding::Document

  # ...
end

We can then code our views with nested attributes support out-of-the-box:

# app/views/marc/records/_form.html.erb
<%= form_with model: @record do |record_form| %>
  <% @record.fields.each do |field| %>
    <%= record_form.fields_for :fields, field do |field_fields| %>

      <%= field_fields.label :tag %>
      <%= field_fields.text_field :tag %>

      <% if field.control_field? %>
        <%= field_fields.text_field :value %>
      <% else %>
        <%= field_fields.text_field :indicator1 %>
        <%= field_fields.text_field :indicator2 %>

        <%= field_fields.fields_for :subfields do |subfield_fields| %>
          <%= subfield_fields.label :code %>
          <%= subfield_fields.text_field :code %>
          <%= subfield_fields.text_field :value %>
        <% end %>
      <% end %>
    <% end %>
  <% end %>

  <%= record_form.submit %>
<% end %>

We can even use validations:

class MARC::Record < ApplicationRecord
  # ...

  validates :fields, presence: true
  vallidates_associated :fields
end

class MARC::Record::Field
  # ...

  validates :subfields, presence: true, unless: :control_field?
  validates_associated :subfields, unless: :control_field?
end

class MARC::Record::Field::Subfield
  # ...

  validates_presence_of :code, :value
end

record = MARC::Record.new
record.valid?
=> false

record.fields = [{ tag: "245" }]
record.valid?
=> false

record.at("245").subfields = [{ code: "a", value: "Ruby on Rails" }]
record.valid?
=> true

We can use custom collections if we need to add custom behaviour:

class MARC::Record::FieldCollection
  include ActiveModel::Embedding::Collecting
  include Enumerable

  def at(tag)
    find { |field| field.tag == tag }
  end

  def repeated?(field)
    # ...
  end

  # ...
end

class MARC::Record < ApplicationRecord
  include ActiveModel::Embedding::Associations

  embeds_many :fields, collection: "FieldCollection"

  delegate :at, :repeated?, to: :fields

  # ...
end

record = MARC::Record.first
record.at("245")["a"].value
=> "Hamlet"

record.repeated?("245")
=> false

record.repeated?("264")
=> true

We can use custom types if we need to cast the elements of a collection:

class MARC::Record::FieldType < ActiveModel::Type::Value
  def cast(value)
    # ...
  end
end

class MARC::Record < ApplicationRecord
  include ActiveModel::Embedding::Associations

  embeds_many :fields, cast_type: "FieldType"

  # ...
end

So the next time you need to model semi-structured data in your Rails application...

Give this gem a try!
Or use the Attributes API.

Top comments (2)

Block Bench • Mar 7 '25

Your text has a few minor grammar and spelling issues that could be refined for better clarity and readability. For example, "accross" should be "across," "don't" should be "doesn't" when referring to "data," and "ressources" should be corrected to "resources." Also, some sentences could flow more smoothly, like rewording “most of us won’t need to use a document database like MongoDB” to something clearer, such as “In most cases, MongoDB would be unnecessary.” Additionally, unnecessary commas, like the one after “database” in the Attributes API sentence, should be removed for better readability. If you're looking for a tool to help with structuring and optimizing digital models, consider exploring Blockbench.

Mansa Keïta • Mar 17 '25

Firstly, thanks for the suggestions. These errors are part of my journey of learning English as a non-native and I'd like to keep them as a reference of how I've improved since then.
However, I don't think that's the best way to market your product, as it has nothing to do with writing, from what I can see.