I’m a programmer. There you have it. I like to write stories. How does one do that? No, I’m not talking about participating in a course or reading books about plotting etc (Fun fact: I’m reading ‘The Hero with a Thousand Faces’). I already did that. I’m thinking about which tools do you use? Google gives you many answers: Scrivener, a simple text editor, Apple Pages, Microsoft Word and mind mapping tools. There are even more tools that ‘most writers’ love: distraction free writing. Concentrating on the story itself, pushing everything else out of the way. It’s a good idea. Do you know any tools for proof reading or working with a text after you have written it?
Programmers have different tools to do their work. There are a lot of things to work with after you finished writing it. Leaving comments, debugging, refactoring, syntax check, grammar check, code linting, test coverage, duplicate code and many, many more. If you think about writing so far, you only have the human proofreading. A poor performance for the 21. century.
Intellisense – guessing the next word
Have you ever tried to code with vim? It’s exhausting compared to IntelliJ, Eclipse or Netbeans. The main reason for me is, that vim (or similar editors) have no code completion. You always have to look in the documentation or other files to find the function names or parameters. Intellisense helps you by providing you with the most probably function, variable or class you want to type next. That’s what I try to do first.
Google instant is a pretty good example of guessing the next word. I don’t know about the inner workings of Google Instant, I can only guess it works with big data models. It’s Google!
I think I faced these challenges:
- Finding the right model to work with
- Acquiring the data to train the model
- Handling and querying large amounts of data (while processing and later when predicting)
Finding the right model to work with
If you think big data and all useful combinations of words are finite, every good word combination must have been used at least once in a large dataset (hopefully many times more). So if you gather every n-gram (a “gram” is a word in this context) combination from the dataset, you could have a good guess about the next word.
In reality this isn’t true. There are many language specific combinations that are logical, but not all are useful to the user. Think about “a, an, the, that, …”, these words come easily to anyone familiar with a language and aren’t needed in the prediction results for the user.
Other examples would include specific synonyms. Consider the data set we use to analyse the data is using “good” for a description of something good in one part of the data and “splendid” in the other part of the data. A lot of combinations and predictions would be lost for the intellisense.
Acquiring the data to train the model
We need lots of data to train the model, the more the better. This point is quite easy to solve, because there are a lot of database dumps these days:
- Your own ebook library
Handling and querying large amounts of data
Handling and querying large amounts of data can be difficult. We are talking about 50 gigabytes of data to parse and to train the model. The model data has to be queried in less than one (1) second to provide a fast user experience. The system requirements of the algorithm should not exceed that of a normal laptop / desktop machine (I don’t own a server farm and I don’t like to purchase a amazon cloud machines for that). The problem is ‘so difficult’ it took me four prototypes to come to a possible solution. The whole text analyser thing is just a pet project of mine, so I used a few concepts without too much evaluation (which someone should obviously do for a productive system).
The first concept is using a Key-Value type of NoSQL databases. You can think of that type of databases like a large HashMap on disk (and partially in memory). I decided to use the ‘fastest’ in-process Key-Value database known to me: LevelDB. An additional benefit is, that the keys are sorted and it’s possible to iterate over these keys easily. This enables searching for phrases.
The next concept is to use a word database. This basically means, that every word gets assigned a number for further reference. This is a type of business-compression and should reduce the amount of data handled and processed by many orders of magnitude while using many little processing power.
Further concepts are highlighted in each prototype section.
Looking back after finishing programming “Test 04”, the “Test 01” prototype looks more and more awful. You can find the code in the test01 package. The first test did use a word database as a simple compression algorithm in a “forward” and “backward” database (“number to word” and “word to number” – HashMap), called SurrogateIndexDB.
The phrase storage uses two databases, called ‘phrase’ and ‘stat’ in the code. The code extracts all 3-grams from the input file and stores the first two words in the phrase database as a key. The value is an index used to link the phrase to the stat database in a combined key. The stat-key is the index (value from the phrase database) and the third word index joined together (
While this prototype did what it should do, it didn’t scale enough to handle the large amount of data. I used a profiler to determine the hot spots, which lead me to “Test 02”.
LimitedList is the class used to produce the n-grams, internally it uses a simple ArrayList. “Test 01” determined this list was the hot spot and therefore it got improved with this prototype. LimitedListFast is a simple array implementation which uses an index to store the moving first element. Additions to the list is implemented as a simple replacement of the oldest array index. I can’t think of a more efficient implementation with java and therefore this implementation is used up to at least “Test 04”.
To further reduce amount of storage needed, the encoding of phrases was improved with multiple ideas. I like to call that concept “Knowledge Key Coding”. The key is structured in such a way, that simple hit/misses tell you what you want to know. In the case of LevelDB you can also take advantage of the sorted-keys ability to further improve the performance of the system.
I encoded the key in the phrase database like that: [Word1, Word2, Word3] = [Occurrences of this phrase]. This removed the need for the “stat”-database. This results in a more difficult querying of the database, but we can use the sorted key feature of LevelDB to work around this issue. The code seeks to the first [Word1, Word2, XXX] index and then continue until another index for Word2 is returned. Simple and effective.
Furthermore I introduced a code at the beginning of the key to store the index of the word in a more efficient way. It’s called “packing” and “unpacking” in the code and removes all leading zeros when the index is converted to a byte array.
If you want to run this test, use the class MCP (Master Control Program) as entry point. Minor improvements were made to use LevelDB’s capability “WriteBatch” to store data more efficiently.
When running this test, the amount of time used to store and retrieve the data from the phrase database was greatly reduced when the database reached a few gigabytes of data. After 6 gigabytes and a few days later I stopped the test, as it hadn’t reached more than 10% of the 50 GB Wikipedia database dump.
I used the so gained database to write a sublime text plugin and tried to use the predictions in a real world application. Unfortunately a lot of garbage was stored in the phrases I didn’t notice before. The way I was parsing the input data resulted in these predictions:
----,- ----,-,-,-, ----,-,kohn, ----,.weiss, ----,cobiss.sr-id,pp.nbsp ----,httpwww.amazon.comdp ----,isbn ----,p. ----,p.. ----,page ----,pp.nbsp
So I need a way to exclude that data, which leads us to “test 03”.
This prototype never got implemented fully, it only contains a WordNormalizer which tries to reduce the input stream to words. These words are then removed of anything I saw not useful for phrase building.
In the meantime I read a book about Natural Language Processing I know, that I programmed the worst stemmer and stop word remover of all time.
The latest prototype combines the ideas made from the other prototypes into one. With the improvements made, I increased the amount of n-grams from three to five. The class used as entry point for this prototype is MCP4.
The power of multiple cores. I wrote a simple pipeline (Warehouses) system where multiple cores are used for normalization and storage of the generated phrases. These provide the analyser with the ability to use more cores for word normalizing than for phrase building and balance the processing power needed between the different parts of the pipeline. This made it possible to almost use an eight core system at around 100% CPU time for the whole processing duration (spoiler: around one day for 50 GB of input data).
Unique phrases. The word indexes (See “Test 01”, word database) used to encode the keys are sorted in this prototype, so that simple swaps in phrases are stored with the same database key. This results in a more difficult querying of the database, but we can use the sorted key feature of LevelDB to work around this issue. The code seeks to the first [Word1, Word2, XXX] index and then continues until another index for Word2 is returned. It also has to search for any other combination (called ‘variation’ in the code), because the word searched for could be on a different position. So the variation [XXX, Word1, Word2] and [Word1, XXX, Word2] are also queried.
Split databases. LevelDB provides fast key storage and retrieval, but it seems that misses are quite expensive and merging of data (loading from disk and updating it) is not so fast. In this prototype I stored around five million phrases in a HashMap in-memory, if more phrases are found the five million are flushed to disk and removed from memory. A new database is created for the next five million. The disadvantage is, that the same phrase may be stored in multiple databases. A post processing is required to remove this penalty:
Sorting and grouping of the partial phrase databases. The resulting databases from phrase generation may have the same phrase stored in multiple databases. To overcome that limitation a post processing step has been implemented (called “Extractor01” in the code), which merges every split database and produces distinct phrases in a set of databases. The so obtained databases are grouped by the occurrences of the phrases and split into small single databases with around ten million phrases in each. The code uses the sorted key ability of LevelDB to use a kind of sorted merge of all databases at the same time.
This step took around six hours and reduced the data storage required from 26 GB to around 22 GB.
Performance of the last Prototype
Unfortunately the performance isn’t as good as I initially expected. Retrieval needs to seek to many databases and each database has to be searched for many variations. Databases are grouped in the file system according to phrase occurrence counts, so it’s possible to easily remove all the databases for phrase count one.
For the following tests a more wide phrase combination is searched so the search time is around 10 seconds for six combinations and maximal two missing words (denoted X in the following examples). The combinations are produced from the last seven words the user typed, like so [Word1, Word2, Word3, Word4, X], [Word1, Word2, Word3, X, X], [Word2, Word3, Word4, Word5, X], … . Each combination needs a lot of variations to be fully queried in the database.
Please note that the “Test 04” has a limited stop word removal and the input is shown after that process. You have to imagine the “a”, “the”, … yourself. The words in the phrases are also sorted by word index and are not in a meaningful order.
- Input: what, wrong, hint, above, read, about, word, stemmers, really
- list wikipediaadministrators, 1
- that pallywood, 1
- article wpdicdef, 1
- into written, 1
- book tibet, 1
- article whips, 1
- more does, 1
- book reference, 1
- Processing time: 1s
- Input: together, took, around, reduc, data, storage, requir
- multiple store, 1
- units capacity, 1
- units however, 1
- just megabyte, 1
- term long, 1
- that tempdb, 1
- with archival, 1
- system such, 1
- Processing time: 0.5s
Not what I did expect, but considering that I basically started without any knowledge of the matter or how to process that amount of data “it’s (quite) something”.
I recently read a book about natural language processing and I noticed what I did wrong. As hinted above, I read about word stemmers and I really should use them. Many of the results predicted by the fourth prototype are basically stop words or Mediawiki syntax. I think I have to read more about natural language processing. I do want to use a library for that and I’m currently searching for one. There is no time for me to implement my own algorithms, but there may be many powerful ones out there already.
The next things to include (or think about) would be:
- Usage of stemming
- Complete removal of stop words
- Usage of Part of Speech tagging
- Usage of professional sentence detection
- Improve predictions by analysing the sentence structure
- Advanced normalization with synonyms
- Use a professional Wikipedia syntax remover (if possible)
- Use longer n-grams for more useful predictions?
- Use a different editor, since the amount of information that you can provide the user via a sublime text completion plugin is limited.