DEV Community

What is TensorFrames? TensorFlow + Apache Spark

Adi Polak on March 25, 2019

First thing first, what is TensorFrames? TensorFrames is an open source created by Apache Spark contributers. It's functions and parameters are ...

Read full post

Gavin Fernandes • Mar 25 '19

Doesn't tensorflow also have the tf.distribute module for horizontal scaling though? What advantages does TensorFrames have in comparison with tf.distribute??

Adi Polak • Mar 25 '19

Hi Gavin, thank you for your comment. it this the one you mean - databricks.com/tensorflow/distribu... ? from their docs it seems like the graph computation itself is being distributed, meaning that each machine calculates only part of the graph. Where in TensorFrame, every relevant row in the distributed data is going through the transformation graph. and what is distributed is the data itself. The graph itself is not distributed and sent as one piece to Apache Spark workers , each Apache Spark worker receives a chunk of the data to work on and return an output, which is later translated back into Spark DataFrame. The Apache Spark advantage is that as long as the data fits in memory, it will do all the calculations in memory without writing to the disk which is due to disk limitations, expensive in time. In tf.distribute doc they give the example of ensemble learning where they send individual machine learning models to multiple workers. They are not working on distributed data, it is more of distributed tasks! which make it very interesting. Does it sound right?

Gavin Fernandes • Mar 26 '19

Yeah that makes sense. At first I thought tf.distribute.MirroredStrategy works with clusters on separate machines as well, but it looks like that's only for devices on the same machine, and that we only have parallel execution of sections of graphs.

That being said, you would think that they'd make data level parallelism with tf.keras easier wouldn't you?

Adi Polak • Mar 26 '19

I would.
It seems like at the moment that tf.keras is an implementation of the Keras API on TensorFlow.
but wait! we can develop in Keras without TensorFlow. Keras is in an individual library for deep learning. There is an interesting project of Keras on top of Apache Spark, named - Elephas: Distributed Deep Learning with Keras & Spark.

As a whole, from discussions and online forums, many Data scientists say that Keras is better for Deep learning since TensorFlow can be a bit complicated to start with.

Gavin Fernandes • Mar 26 '19

Yeah I know keras is an independent library as well, and yeah it is simpler, but I started machine learning with the low level tensorflow API and only then learnt keras. I do use just keras where I can though.

Currently I'm working on a project that requires the sort of fine control over the training process that only tensorflow can give me, although I haven't tried theano or the rest yet, and it would be infeasible to move to another library with the time constraints we have.

Adi Polak • Mar 27 '19

yeah, project and time constraints are super important. How do you find TensorFlow? From your perspective, how can one become proficient in it?

Gavin Fernandes • Mar 27 '19 • Edited

I like tensorflow and all, but I can't say its without its flaws. It feels like parts of the library are duplicated elsewhere within, and some sections lack succint documentation.

I was working with TFRecord a few weeks ago, and the long and short of it is there were two different ways of writing a TFRecord, and both gave you different output files, which were both valid TFRecords. Plus TFRecords aren't simple feature-label <rant> ... </rant>.
Jeez, I stuck to pandas after that.

I think tensorflow is going in the right direction though. They're working to bring keras and estimators closer together with tf 2.0, and in all fairness to them, some of the bumpy edges that I encountered were sections still in development.

Now my perspective is probably not representative of the wider community here on dev.to. For one thing, I don't do JS/WebDev, and stick to C/C++ and python(3), dabbling in Dart and Clojure a bit. For another, my aim isn't to be a data scientist / coder, and I am by no means proficient in tensorflow. With that said, I feel like the best way to get better with tf is to use it more, whether that be in personal projects, or contributing to someone elses. If you really want to push yourself, and have the time to spare, you could try reimplementing bits of tensorflow, say for example the Convolutional layer, or tanh activation, or maybe even an optimizer. When you're done you can compare it with what the tensorflow source code does as a benchmark.