Tuesday 17 June 2014

NoSQL for Conlangers

In his blog, fellow-conlanger +Wm Annis writes that the best database format for dictionaries is text.

All his points are valid, but at one point he says The standard is SQL, and that got me thinking. I've done a fair bit of work with SQL, and can do scary things with it, but I wouldn't choose to use it. It's inflexible and clunky. You have to decide your schema in advance, and if your requirements change at a later date, you have no choice but to rebuild entire tables. Anything more complex than a simple one-to-one relationship requires a second table and a join. SQL basically expects you to fit your data to the model, and what you need is to fit the model to your data. Using an ORM like SQLAlchemy doesn't help - it's just a layer of abstraction on top of an inherently clunky system.

For a good dictionary system, you need the flexibility of a NoSQL database. One popular system, that I've done a lot of work with, is MongoDB. This stores documents in JSON format, so a dictionary entry might look like this

{"word":"kitab",
  "definitions":[{"pos":"noun",
                            "definition":"book"]},
"inflections":{"plural":{"nominative":"kutuub"}},
"related":["muktib","kataaba"]}

If a field exists for some words but not others, you only need to put it in the relevant entries. If a field is variable length, you can store it in an array. One slight disadvantage is that cross-referencing between entries can be a little tricky.

Another possibility is ZODB. This is an object persistance system for Python objects. In many ways it's similar to MongoDB, but there's one important difference. If a member of a stored object is itself an object that inherits from persistant, what is stored in the parent object is a reference to that object. Cross-referencing is therefore completely transparent. The only small disadvantage is that it's Python-specific, but unless you really need to write your dictionary software in a different language, that shouldn't be a big problem.

You might also want to consider a graph database like Neo4j. This stores data as a network of nodes and edges, like this

kitab-[:MEANS]->book
kitab-[:PLURAL]->kutuub-[:MEANS]->books

In theory, this is the most flexible form of database. I wouldn't say it was easy to learn or use, though.

There are plenty of other NOSQL databases, these are just the ones I'd use, but I think they're all more suitable for dictionary software than SQL. But do make sure you have a human-readable backup.

No comments:

Post a Comment