SDC, Mongo, and Scaling

In this post:

Designing Relational and NoSQL Schemas
Lessons from the Blog Project
What's exciting about this data

Designing Relational and NoSQL Schemas

We're in the midst -- day three -- of SDC - the backend project in which we'll learn and apply concepts like:

Database selection and schema design
Horizontal and Vertical scaling
Deployment and server administration

I'm excited to learn these things.

Our current challenge is to design schemas for a relational and a non-relational database. Our group is going with two common choices: PostgreSQL and MongoDB, respectively.

The relational schema design was fairly straightforward. Focusing on normalization and primary-foreign key relationships, I was able to conceptualize the schema without any real struggle.

Designing for MongoDB is more of a challenge and a bit more interesting for all of the choices the designer must consider. Choices like:

How are parent-child relationships referenced with data embedding? (What's the cardinality of the 1-to-N relationships)
Are there high read-write ratios?
How do you intend to access data?

Lessons from the Blog Project

Already my blog project has paid dividends because I've gotten some hands-on experience with the tradeoffs that these choices imply.

I decided to go with what I felt was the simplest schema for my early purposes - a blog post model that embedded author info within the post. Great for creating a feed of blog posts (the author is right there in the post data), but not great for accessing author information (creating a list of authors). The project is a great playground for trying new things - even weak ideas, to learn what makes them weak.

What's exciting about this data

The Size of the dataset. This is the first time working with data that comes in the hundreds of thousands of rows. We're talking csv files that are hundreds of MBs in size!

This is the kind of data I want to be working on -- as a student but also professionally -- so I'm excited to be at this stage. This is when things like big-O notation come back around - when designing projects with scale in mind. And I suspect that making bad decisions here can really punish us with poor performance and in turn, a poor user experience.

Some Modeling Considerations:

Answers have to be accessed on their own.
1-to-N Question-Answer (Answers never stand alone)
There are 'many' (versus 'some' or 'squillions') questions and answers respective to their parent relationship.

Questions:

Post/insert speed: important or not important?

If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references

https://stackoverflow.com/questions/31537930/how-do-i-reference-objectid-in-another-collection-in-mongodb-node

I'd like to assume that there are several thousand questions per product and thus use an array of object id references to questions.

This is just the start and I'm excited to go deeper!