NLP | Storing Frequency Distribution in Redis

The nltk.probability.FreqDist class is used in many classes throughout NLTK for storing and managing frequency distributions. It’s quite useful, but it’s all in-memory, and doesn’t provide a way to persist the data. A single FreqDist is also not accessible to multiple processes. All that can be changed by building a FreqDist on top of Redis.
What is Redis?

  • Redis is a data structure server that is one of the more popular NoSQL databases.
  • Among other things, it provides a network-accessible database for storing dictionaries (also known as hash maps).
  • Building a FreqDist interface to a Redis hash map will allow us to create a persistent FreqDist that is accessible to multiple local and remote processes at the same time.

Installation :

  • Install both Redis and redis-py. The Redis website is at http://redis.io/ and includes many documentation resources.
  • To use hash maps, install the latest version, which at the time of this writing is 2.8.9.
  • The Redis Python driver, redis-py, can be installed using pip install redis or easy_install redis. The latest version at this time is 2.9.1.
  • The redis-py home page is at http://github.com/andymccurdy/redis-py/.
  • Once both are installed and a redis-server process is running, you’re ready to go. Let’s assume redis-server is running on localhost on port 6379 (the default host and port).

How it works?

  • The FreqDist class extends the standard library collections.Counter class, which makes a FreqDist a small wrapper with a few extra methods, such as N().
  • The N() method returns the number of sample outcomes, which is the sum of all the values in
    the frequency distribution.
  • An API-compatible class is created on top of Redis by extending a RedisHashMapand then implementing the N() method.
  • The RedisHashFreqDist (defined in redisprob.py) sums all the values in the hash map for the N() method

Code : Explaining the working




from rediscollections import RedisHashMap
  
class RedisHashFreqDist(RedisHashMap):
    def N(self):
        return int(sum(self.values()))
      
    def __missing__(self, key):
        return 0
      
    def __getitem__(self, key):
        return int(RedisHashMap.__getitem__(self, key) or 0)
      
    def values(self):
        return [int(v) for v in RedisHashMap.values(self)]
      
    def items(self):
        return [(k, int(v)) for (k, v) in RedisHashMap.items(self)]


This class can be used just like a FreqDist. To instantiate it, pass a Redis connection and the name of our hash map. The name should be a unique reference to this particular FreqDist so that it doesn’t clash with any other keys in Redis.

Code:




from redis import Redis
from redisprob import RedisHashFreqDist
  
r = Redis()
rhfd = RedisHashFreqDist(r, 'test')
print (len(rhfd))
  
rhfd['foo'] += 1
print (rhfd['foo'])
  
rhfd.items()
print (len(rhfd))


Output :

0
1
1

Most of the work is done in the RedisHashMap class, which extends collections.MutableMapping and then overrides all methods that require Redis-specific commands. Outline of each method that uses a specific Redis command:

  • __len__() : This uses the hlen command to get the number of elements in thehash map
  • __contains__(): This uses the hexists command to check if an element existsin the hash map
  • __getitem__(): This uses the hget command to get a value from the hash map
  • __setitem__(): This uses the hset command to set a value in the hash map
  • __delitem__(): This uses the hdel command to remove a value from thehash map
  • keys(): This uses the hkeys command to get all the keys in the hash map
  • values(): This uses the hvals command to get all the values in the hash map
  • items(): This uses the hgetall command to get a dictionary containing all the keys and values in the hash map
  • clear(): This uses the delete command to remove the entire hash map from Redis


Contact Us