Objective: This repository is created to capture pointers, guidance for fundamentals around NLP (Natural Language Processing) from learning perspective, innovation / research areas etc. It also throws light into recommended subject areas, content relating to accelerating in the journey of learning in this field.
Target Audience: Data Science and AI Practitioners with already having fundamental, working knowledge and familiarity of Machine Learning concepts, Python/R/SQL programming background.
- Research Focus and Trends
- Intro and Learning Content
- Techniques
- Libraries / Packages
- Services
- Datasets
- Video and Online Content References
- CourseReferences
- Updates in 2022-2023:
- Generative AI from Text-to-Image generation standpoint:
- Hierarchical Text-Conditional Image Generation with CLIP Latents DALL-E 2
- High-Resolution Image Synthesis with Latent Diffusion Models Stable Diffusion
- LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models - CLIP is used
- Generative AI from Text-to-Image generation standpoint:
- Please keep referring to NLP related research papers from AAAI, NeurIPS, ACL, ICLR and similar conferences for latest research focus areas. Most of these may be captured in the arXiv.org site as well.
- Few latest and key research papers for reading are as follows: (Please note this keeps changing and may not be dated)
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale - the GitHub page
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer - the GitHub page with pretrained models along with the dataset and code
- Reformer: The Efficient Transformer - the GitHub page with official code implementation from Google and the GitHub page with PyTorch implementation of Reformer
- Longformer: The Long-Document Transformer - the GitHub page
- NLP-Progress tracks the progress in Natural Language Processing, including the datasets and the current state-of-the-art for the most common NLP tasks.
- NLP-Overview is an up-to-date overview of deep learning techniques applied to NLP, including theory, implementations, applications, and state-of-the-art results. This is a great Deep NLP Introduction for researchers.
- Detect Radiology related entities with Spark NLP
- NLP's ImageNet moment
- ACL 2018 Highlights: Understanding Representations and Evaluation in More Challenging Settings
- Four deep learning trends from ACL 2017 - Part 1 - Linguistic Structure and Word Embeddings
- Four deep learning trends from ACL 2017 - Part 2 - Interpretability and Attention
- Deep Learning for NLP: Advancements & Trends
- Deep Learning for NLP : without Magic
- Stanford NLP
- BERT, ELMo and GPT2 How contextual are Contexualized Word Representations? - from Stanford AI Lab
- The Illustrated BERT, ELMo and others NLP and transfer learning context
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, Related Code
- A Mutual Information Maximization Perspective of Language Representation Learning
- DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling
- Lexical Semantics, Semantic Processing
- POS Tagging
- Discourse
- Paraphrasing / Entailment / Generation
- Machine Translation
- Information Retrieval
- Text Mining
- Information Extraction
- Question Answering
- Dialog Systems
- Spoken Language Processing
- Speech Recognition & Synthesis
- Computational Linguistics and NLP
- Chunking / Shallow Parsing
- Parsing / Grammatical Formalisms etc.
Area | Description | Target Timeline |
---|---|---|
Pre-Requisites |
|
Week 0 |
Handling Text Processing |
|
Week 1-4 |
Language Modeling & Sentiment Classification with DL, Translation with RNNs |
|
Week 5-8 |
Reading and handling Text from Images |
|
Week 9-12 |
- Text Embeddings
- Word Embeddings
- Thumb Rule: fastText >> GloVe > word2vec
- Implementation from Facebook Research - fastText
- gloVe : Global Vectors for Word Representation - Explainer Blog
- word2vec - Implementation - Explainer Blog
- Sentence and Language Model Based Word Embeddings
- ElMo : Embeddings from Language Models : Basics , Deep contextualized word representations
- PyTorch Implementation from AllenAI/AllenNLP
- TF Implementation from AllenAI
- ULMFiT : Universal Language Model Fine-tuning for Text Classification by Jeremy Howard and Sebastian Ruder - Paper Ref
- InferSent - Supervised Learning of Universal Sentence Representations from Natural Language Inference Data by facebook - Paper Ref
- ElMo : Embeddings from Language Models : Basics , Deep contextualized word representations
- Word Embeddings
- Question Answering and Knowledge Extraction
- DrQA - Open Domain Question Answering work by Facebook Research on Wikipedia data
- Document-QA - Simple and Effective Multi-Paragraph Reading Comprehension by AllenAI
- Privee - An Architecture for Automatically Analyzing Web Privacy Policies
- Template-Based Information Extraction without the Templates
- R NLP Libraries
- text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R
- wordVectors - An R package for creating and exploring word2vec and other word embedding models
- RMallet - R package to interface with the Java machine learning tool MALLET
- dfr-browser - Creates d3 visualizations for browsing topic models of text in a web browser.
- dfrtopics - R package for exploring topic models of text.
- sentiment_classifier - Sentiment Classification using Word Sense Disambiguation and WordNet Reader
- Python NLP Libraries
- NLTK - Natural Language ToolKit
- TextBlob - Simplified text processing. Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of Natural Language Toolkit (NLTK) and Pattern, and plays nicely with both
- spaCy - Industrial strength NLP with Python and Cython
- gensim - Python library to conduct unsupervised semantic modelling from plain text
- scattertext - Python library to produce d3 visualizations of how language differs between corpora
- GluonNLP - A deep learning toolkit for NLP, built on MXNet/Gluon, for research prototyping and industrial deployment of state-of-the-art models on a wide range of NLP tasks.
- AllenNLP - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.
- PyTorch-NLP - NLP research toolkit designed to support rapid prototyping with better data loaders, word vector loaders, neural network layer representations, common NLP metrics such as BLEU
- Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
- Amazon Comprehend - NLP and ML suite covers most common tasks like NER (Named Entity Recognition), tagging, and sentiment analysis
- Google Cloud Natural Language API - Syntax Analysis, NER, Sentiment Analysis, and Content tagging in atleast 9 languages include English and Others
- Microsoft Cognitive Service: Text Analytics
- IBM Watson's Natural Language Understanding - API and Github demo
- Cloudmersive - Unified and free NLP APIs that perform actions such as speech tagging, text rephrasing, language translation/detection, and sentence parsing
- ParallelDots - High level Text Analysis API Service ranging from Sentiment Analysis to Intent Analysis
- Wit.ai - Natural Language Interface for apps and devices
- Rosette - An adaptable platform for text analytics and discovery
- TextRazor - Extract meaning from your text
- Textalytic - Natural Language Processing in the Browser with sentiment analysis, named entity extraction, POS tagging, word frequencies, topic modeling, word clouds, and more
- Stopwords: Stop words are words which occur frequently in a text corpus. e.g a, an, the, in. Frequently occurring words are removed from the corpus for the objective of text-normalization. We can import
from nltk.corpus import stopwords
to leverage this facility - Stemming: It is reduction of inflection from words. Words with same origin will get reduced to a form which may or may not be a word. NLTK has different stemmers which implement different methodologies.
- NLP-datasets - Great collection of NLP datasets for use
- gensim-datasets - Data repository for pretrained NLP models and NLP corpora
- Stanford Deep Learning for Natural Language Processing (cs224-n) - Richard Socher and Christopher Manning's Stanford Course
- Deep Natural Language Processing - Lectures series from Oxford
- Neural Networks for NLP - Carnegie Mellon Language Technology Institute there
- fast.ai Code-First Intro to Natural Language Processing - This covers a blend of traditional NLP topics (including regex, SVD, naive bayes, tokenization) and recent neural network approaches (including RNNs, seq2seq, GRUs, and the Transformer), as well as addressing urgent ethical issues, such as bias and disinformation. Find the Jupyter Notebooks here
- Deep NLP Course by Yandex Data School, covering important ideas from text embedding to machine translation including sequence modeling, language models and so on
- Machine Learning University - Accelerated Natural Language Processing - Lectures go from introduction to NLP and text processing to Recurrent Neural Networks and Transformers. Material can be found here
- Knowledge Graphs in Natural Language Processing @ ACL 2020
- Practical Notebooks around 330+ leveraging NLP techniques
- NLP at Scale - MLOps aspects for customer success
- MLM with BERT
- Stanford course CS224n for NLP using DL
- NLP using Huggingface
- Georgia Tech course ref
- NLP in Tensorflow from Coursera
End of Contents
Disclaimer: Information represented here is based on my own experiences, learnings, readings and no way represent any firm's opinion, strategy etc or any individual's opinion or not intended for anything else other than learning and/or research/innovation in the field. Content here and on this repository is non-exhaustive and continuous improvement / continuous learning focus is needed to learn more. Recommendation - Keep Learning and keep improving.