Natural Language Processing in 3 lines of code? — video

Introduction to the Transformer Library for Advanced NLP

[1]: from transformers import pipeline[2]: nlp = pipeline("sentiment-analysis")[3]: result = nlp("Lovely atmosphere, staff are super friendly and wonderful people.")[0][4]: print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
Image for post
Image for post

I asked my Bootcamp cohort fellow “I want to do sentiment analysis on reviews for a friend, do you know of any good leads?”. He wrote back “Look at the Transformer library, you can do an accurate prediction with 3 lines of code”

Image for post
Image for post

and he’s not fooling, I made an exploratory journey through the library and what it has to offer. You can see the journey below in the embedded Youtube video. I was pleasantly surprised to see how clear and understandable the documentation and instructions were for a quick start!

What I found in my brief exploration

Thanks to the user-friendly documentation and tutorials I was able to learn easily about various models the library has pre-trained and keeps ready to perform advanced NLP. Well known, state of the art models like ‘GPT-2’, ‘BERT’, and its variations were all listed.

I had a thorough look through the quick start and learned much about the way these models want the input data to be formatted. In particular, I learnt that the models make numbers out of words and those numbers are provided to the model as lists or tensors.

I learnt about how the data is preprocessed and how the rules around preprocessing change depending on the model being used. Machine learning models often need the shape of two separate input data to be equivalent and so padding is often used to add 0’s to some of the input data. Truncation is used when you need to shorten the shape of the data dimensions.

I also learnt how the models would receive an attention mask to tell the model where those 0’s were. An attention mask is like a map for the model to know where its attention should go. It clearly labels the areas of a tensor that are just padding (the 0’s) and so the model can ‘attend’ the actual data.

and more…

References

  1. Transformers Library — homepage
  2. GPT -2 model, too dangerous to release publicly — GitHub, Openai blog
  3. BERT model — wiki page
  4. A gentle introduction to Tensors — Machinelearningmastery

Practicing Data Scientist. Interested in Games, Gamification, Ocean Sciences, Music, Biology.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store