DEV Community


Posted on

Spark NLP: State of the art natural language processing at scale

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. This talk introduces the Spark NLP library – the most widely used NLP library in the enterprise, thanks to implementing production-grade, trainable, and scalable versions of state-of-the-art deep learning & transfer learning NLP research, as a permissive open-source library backed by a highly active community and team.

Spark NLP natively extends the Spark ML pipeline API’s which enabling zero-copy, distributed, unified pipelines, which leverage all of Spark’s built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines will be shared. The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. The talk will demonstrate using these algorithms to solve commonly used tasks, using Python notebooks that will be made publicly available after the talk. Bio: David Talby is a chief technology officer at John Snow Labs, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare & life science. Previously, he was with Microsoft where he led business operations for Bing Shopping in the US and Europe, and before that at Amazon in Seattle and in the UK, where he built and ran distributed teams that helped scale global financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Top comments (0)