Josh Laurito is the head of data engineering at Fusion Media Group, the publisher of the web’s most widely read media brands, reaching over 90MM unique visitors a month. His team occasionally blogs at fmgdata.kinja.com. He also runs a newsletter for the NYC data community that you can subscribe to here.
Most of our work is dedicated to our publishing platform, Kinja. Kinja used to only allow homepages organized like blogs, with stories listed newest-to-oldest. About a year ago, we started supporting stories being 'pinned' to the top of the homepage. This was a great change that was applauded throughout the company, giving our editors a chance to highlight each publication's best work.
The flip side of manual curation, though, was that we as an organization hadn’t ever really picked out stories for special coverage this way before. While it was generally pretty clear to the editors what stories should be pinned, we had no experience with how long they should stay pinned. Is a big story from 8 hours ago more engaging to our audience than a smaller story that’s just breaking now?
At first my team did some analysis, thinking we could automatically promote stories to the top of the page, or at least recommend what stories should go there.
However it was pretty clear that the editors needed to have a lot of control here. Editorial organizations have their own sensibilities that are really difficult to articulate in a model. Additionally, there were lots of planned event coverage, like the Apple WWDC for Gizmodo or E3 for Kotaku, where we’d need to have posts queued up to override anything automated.
In the end, we built a lightweight alerting system. The math is pretty straightforward: we calculate expected click rates for stories on each site that we support, then see how stories are performing against that. We integrated the alerts into each publication’s Slack room, and added a glyph system so editors knew exactly what the numbers meant. Here’s an example from Lifehacker:
So instead of building something that was completely automated around pinning stories, we used the data to educate and support our editorial team. It’s been successful in that it drives changes to the homepage, but doesn’t dictate terms: if editors want to keep up a post despite lower objective performance, maybe because the reporting is excellent, or they just think a piece is really fun, they have that option.
Beyond curation, does the data team also make suggestions about the actual headlines or content of articles?
We've talked about this a lot, and we've shied away from it so far. Occasionally for sponsored content or the sales team we'll help research a strategy, but not for editorial so far.
There are two main data-oriented reasons we've stayed away (our editorial staff could probably give you a few more). One is that we're worried that we'll dilute what makes our publications special, which is their voice. When lots of people are testing headlines or content, there's going to be convergent evolution, which means sites will be harder to differentiate between. While I have not doubt that leads to short-term improvements in metrics, I am certain it will dilute what makes us a destination for our readers.
The second reason is that the implicit assumption behind testing is that the audience you're training on (today's traffic) looks a lot like your test set (future traffic). But the fact is that most of our traffic is filtered through someone else's algorithm first, be in Facebook, Google, Twitter, or someone else. I don't have a lot of confidence in the stability of those algorithms, and I think over-optimizing for them could make us susceptible to changes. I'd rather let the editorial and audience teams try what they want and have a more diverse set of headlines and content.
I actually graduated from college with a degree in chemistry, and moved into finance, figuring that even if I didn't know what I wanted to do, at least I could make some money and meet some smart people. It was right during the housing boom, and I had a friend who worked in a mortgage derivatives group at a big bank who helped me get a job as a structurer, which is basically someone who does scenario modeling.
The work was interesting and challenging, but the workload was overwhelming if you couldn't manage it. I learned how to program in order to keep up with all the requests for my time. Most of these lessons happened thanks to the guy who sat next to me, who had a master's in computer science and took pity on me manually entering scenario parameters over and over again at 1AM.
The bottom fell out of the market a few years later and all of us got laid off. It was a tough lesson in the limits of what our models could do. I actually worked with a few people who ended up as characters in The Big Short.
I ended up at an insurance company and one day, the CEO of the company walked by my desk and complained to me how he had no way to match his data with government data. So if he wanted to see how many of his insurance policies were written in areas that have, for example, high unemployment rates, he was stuck.
It sounded like a really interesting problem, so between Christmas and New Year's when nobody else was around I prototyped an app that took information from internal and government sources, and put together choropleth maps.
When people came back in January, I showed off my app and everyone was pretty excited about it. We built small team around the idea. What I didn't know at the time was that the insurance company had invested too much in mortgages, just like the bank, and was going to be effectively going bankrupt soon.
Fortunately, two of the executives I worked with were interested in spinning off a startup based on my app. That became my first startup, Lumesis. The two founders wanted to move the company up to Connecticut, which didn't really appeal to me, but I was excited about the tech world after getting a taste, and started looking for other jobs in tech that used math and statistics to solve problems.
I bounced around a few startups before a friend of mine who I had work with at CrowdTwist, a Techstars company in Flatiron, recruited me over to Gawker. I started working there in 2014, and have stayed here through the bankruptcy and acquisition by Univision.
Oh wow, it's like night and day. When I started, I was effectively the only data scientist/analyst (we had a data engineer though). I spent almost all my time coding, testing, and writing up results.
Now that the team is bigger and we work for a large corporation, I spend a lot of time managing, doing strategy work, and hiring. Speaking of which, we're hiring for product, engineering, and data roles, as well as elsewhere in the organization. Your readers should take a look!
The advice I give to everyone is to complete a project and make it public. When I taught data visualization, I forced my students to make their final projects public. It doesn’t really matter what it is: tied to work or not, a visualization or an open-source library, whether it’s polished or not, you need to get it out there.
All of us who work with numbers or code for a living know that most project include some ugly parts, whether you think of them as kludges or technical debt or hand-waving through assumptions/theory. Just being able to ship something that delivers what it promises puts you ahead of most people who would like to work in data science.
When I got started I did all sorts of dumb fun projects around things I was interested in, just to learn. I made a Mouse Speedometer, a map of the US banking system, a comparison of European Languages, a bunch of stuff. None of these are going to win any awards, but they helped my build a portfolio, and generally were a lot of fun to play around with as I was learning new tech.
When I’m hiring, I love seeing what people have done before, either at work or on their own. I think the number one complaint about data-oriented people in industry is that we aren’t always good at shipping products out the door, so knowing that someone actually gets stuff done means a lot to me.