DEV Community

Cover image for Our Website Source Is Now Private, A Cautionary Tale
Kellen for QuestDB

Posted on • Originally published at questdb.io

Our Website Source Is Now Private, A Cautionary Tale

QuestDB is a high-performance time-series database built in Java and C++, with no dependencies and zero garbage collection. Check us out if you have time series data and are looking for high throughput ingestion and fast SQL queries. Not to worry: QuestDB remains open source!

"Imitation is the most sincere form of flattery" - Oscar Wilde

As an open source company that strives to be as transparent as possible — both internally and in the community — we want to have all our code public. We want you to see it all, even for our blogs, docs and marketing pages. But as far as our website goes, recent unfortunate events have made us reconsider our position.

We'll share what we learned so that it doesn't happen to you!

Great artists…

For page traffic, we want to know whose reading what, what is shared — if we know what you like and what is helpful, we can make more of it.

We're growing, and our traffic remains modest. When something succeeds on Reddit or HackerNews, we feel it. When the charts go up, it's a real thrill.

During one such event, one of our engineers, Maciej, looked into Google Analytics and noticed something of an... anomaly in our traffic. It had exploded.

And a large portion of it originated from an uncommon region — Brazil:

158 thousand views from Brazil. Usual highest, US, below with 1400.

Hey, neat! Something must have resonated within the Brazilian developer community. But which page? One of our deep technical articles? Must be.

A few clicks later, we determined that the source of traffic was not what we had hoped. Far from coming from one of our fresh new articles, it was generated by an unknown page. A page with which none of us had any familiarity. This was because it was, in fact, from a different website entirely:

The offending site, but crossed out.

Uh oh. But was this just one page? Nope, it was many of them.

Looking deeper, traffic spanned many paths:

A list of strange paths that imply... maybe games? Table games?!

Apart from the root domain, these were all unexpected. Not good. But with all this data pointing to unfamiliar pages, we knew exactly where to look for answers. The moment we landed on the strange site, it confirmed what we had all expected.

While the main landing page and its site paths were altered, the rest our site had been copied over and hosted in full: metadata, supporting pages with trademarks, images, copy, logos and all. Little effort was taken to obscure this fact.

According to Google Analytics, in a short time we had collected well over 150,000 new visitors to “our site”. But in reality, it was from an entirely different one.

Well, hey. Traffic is traffic, right? All press is good press? And the theme was open source, isn't that what open source is for? Yes and no, it's not so clear.

There are things to consider.

Not flattering

Their post-launch results were staggering. Once live, their pages spread like wildfire. And we know, we have the data to prove it. But the intermingling traffic made us nervous. This is all much different from our usual keywords, demographics and traffic volume. On top of that, it came from an industry that is somewhat of a gray area.

Could we receive punishment as a result of the clientele of this site? Did we just become flagged as a “toxic website” to the many sites that interlink with our content? Are our search rankings about to plummet?

Any of that would be very bad. Can we resolve this, and fast? Yes, we can. But before we go any further, how did this even happen?

We can point to two key reasons.

Docusaurus showcase

We use a static site generator called Docusaurus for our docs, blogs and marketing pages. The design is customized, and tailored for your (and thus our) needs. As such, it felt great for the team when Docusaurus featured the QuestDB.io design within their showcase:

QuestDB.io shown in the Docusaurus showcase.

People looking for inspiration for their new Docusaurus site can use the filters to select “open source”. From there, as expected, a click of the source button leads you to, that's right — all of it, the goods. This very blog and all its surrounding pieces are hosted from within that repository, in its entirety…

Given the complexity and uniqueness of our Docusaurus rendition, it did not seem likely someone would fork it. There has been minimal work to template it as such. One does not simply swap some config options and colours and arrive at a brand new site. We're also on a fairly dated version of Docusaurus, and we'd expect people to want the new stuff. But it turns out we were wrong.

And upon reflection, we get it. The source is open and under a permissive license: Apache 2.0. People are free to do as they will, within the parameters of the license. But one consideration of the license is to respect existing trademarks and product names:

... This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor...

Using the styles, code and layouts is one thing. But using the same logos, trademarks, and so on, is another. And as for their traffic appearing in our Google Analytics, the blame for that lies with us.

Remember your env vars

Analytics providers, search engines, and other third parties who provide tokens that are browser exposed, usually provide safe, read only keys. They are most often non-destructive, as they are exposed to the client.

As many of you know, each property in Google Analytics gets its own GA tag which is then placed in Google's JavaScript snippet. These are visible when you inspect a website's source.

To prove it, visit a tech site that you like, inspect the source, and search for GTM. You will most likely find a GTM-XXXXX value. There is nothing stopping you for using it, except that it won't reveal anything unless you have access to the matching Google Analytics dashboard.

Though it is somewhat uncomfortable to admit, we soon confirmed that our Google Analytics tag was hard-coded directly in the source:

Shows us applying an ENV VAR over the hard coded GTM tag.

This meant that the repository could be cloned or forked, and bam — the new website is a part of QuestDB, as far as Google Analytics is concerned. This is both unfortunate, and preventable. In hindsight, the value could have been hidden behind an environment variable in the actual code. It is now!

The moral of the story: set env vars for any sensitive - or somewhat sensitive - variable. It is basic advice, yes, but sometimes one forgets just how wild and random the broad internet can be. Even if you think no one would use it or that no damage could be done, set it as an env var anyway.

Closed source, for now

Despite reaching out to Google, we were unable to get anyone to provide any help. But it's no matter, we cleaned up and applied a new Google Analytics tag. And it appears that the blip is not interfering with our business in any major way.

As of yet we have seen no punitive impact to our rankings. But as a precaution, we have decided to make a change to the visibility of our website repository. The website source will now be set to "private".

A trolley problem meme. Do we keep the source open, and assume risk?

This isn't so we can twirl our moustaches like villains and apply shady marketing practices. It's to protect us so that if we do silly things like forget an env var, we won't risk cratering the hard-earned value of our website property.

That said, we will work to open parts of it in the future. For example, documentation has received helpful contributions from community members. Closing the door on them doesn't feel right. Luckily, there is a way around that.

Right now, QuestDB.io is a single build of our Docusaurus repository. Any doc, blog or content changes in the repo generates a new Netlify build. In the future, we can extract doc contents — which exist in their own folder as .md or .mdx files — and host them in their own open repository.

Using GitHub Actions or other build runner, we can setup a pipeline like this:

  1. On PR event, create a temporary workspace
  2. Pull private repo & docs into workspace
  3. Build workspace as though whole
  4. Provide PR preview
  5. On merge, rsync doc contents to private repo
  6. On commits to main, rebuild production website

With this model, our documentation can remain open even with the rest closed.

Summary

QuestDB is an open source company. We want to work in the open: no secrets! But we've decided to make our website source private for the time being.

Remember: even if the tag or key or whatever seems benign, use an environment variable. It might save you from a real hassle.

Top comments (1)

Collapse
 
cicirello profile image
Vincent A. Cicirello

The clone of your site will probably be flagged by Google as duplicate content and excluded from search results since your original content was probably already indexed. If you had canonical urls in the source they copied with full absolute paths and if the cloner didn't change them in their copy, then Google will get that as an additional cue that yours is the authoritative. So you probably won't be penalized by search engines.