NLP Pipeline trained on Modern Art Descriptions
This ☝🏽 got my wheels turning…
Could I design an algorithm that could write convincing modern art installation descriptions? Being a total noob with nlp, I wondered how far I could get in a reasonably short amount of time. Though I’m still wrapping my head around the software tools, I think it can be done.
My first step has been building up a corpora (just a body of domain-specific text) from 277 modern art exhibits in 4 New York City art galleries –– David Zwirner, Gagosian, Gladstone, and Hauser and Wirth. Currently, that corpora consists of 83,000 + words, and it gets bigger each day.
My next step is to use Faker
to subclass my own provider with the intent of generating plausible-sounding verb phrases. I will build these from the nouns, adjectives, and verbs pulled from the modern art corpora.
Concurrently, I’ll be looking at learning a LOT more about lemmatizing, tagging, and vectorizing –– just to further wrap my head around a lot of very fascinating ideas and tools.
Word Frequency Distributions within the Corpora
The following distributions were found using nltk
. They zero in on a corpora of 138,641 words, filtered down to 83,673 tokens (filtered of all stopwords like the, a, an, etc.) . The quality of the corpora can further be improved by using a stemmer and/or lemmatizer – which I intend to do.
Terminology: AI, NLP, and ML
AI (artificial intelligence), NLP (natural language processing), and ML (machine learning) intersect, but are not interchangeable. Practical applications of AI are not new. Think of Charles Babbage’s difference engine in the 1820’s, player pianos in the 1890’s, and Turing Machines in the 1930’s
AI (artificial intelligence)
AI is just automated decision making – brute force logic that produces absolutely deterministic results. I like Sam DeBrule’s definition below:
NLP (natural language processing)
NLP focuses on written language – stuff like spam filtering, sentiment analysis, and automated chat bots. SAS is a company that does this kind of work, and gives a pretty useful definition of it.
ML (Machine Learning)
This is what the average person typically has in mind when they’re worried that terminators are being incubated at Skynet. Again, I like Sam DeBrule’s excellent definition below:
Specific Tools I used
I’ve only begun scratching the surface of the first 3 items on the following list, but I’ve been using Faker since 2017.
NLTK is an open source outgrowth of an academic software project developed by Steven Bird and Edward Loper of UPenn. It provides easy-to-use interfaces for over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. There’s a huge body of contributors, and broad consensus that this library is the nlp gold standard.
Gensim is a topic modeling library that’s a bit more polished than NLTK (from what I can see from the onramping docs). The first difference I noticed was that Gensim’s stopwords are a lot more sensible than NLTK’s – without having to fuss around with additional regular expressions. I cannot go into more detail because I’m still learning about both packages.
WordNet® is a huge lexical database of English nouns, verbs, adjectives and adverbs. It’s bundled with nltk, and allows you to group tokens (words) into sets of cognitive synonyms (synsets), each expressing a distinct concept. Put simply, you can input a word… lousy… and get a crap load of synonyms for that word, plus definitions, plus usages of those words in context of real English sentences!
Faker is a Python package that generates realistic looking test data for anyone interested in software testing. It’s like Lorem Ipsum on steroids. I am using it to subclass my own provider as a way to generate realistic-looking verb phrases.