Storing Dynamic and Static Metadata in Relational Databases - Part 3
posted on 26 March 2015 by Jim Moffitt
Storing Dynamic and Static Metadata
While much of Twitter metadata is dynamic in nature, changing tweet-by-tweet, other metdata can change more slowly or stay completely static for long periods of time. Twitter user accounts (stored in the ‘Actor’ object in the Activity Stream format) provides good examples of both ‘static’ and ‘slow’ metadata. An actor’s numeric ID and account creation time will never change. Their account-level language, timezone, location, display and handle names may rarely change. Meanwhile their status, followers and friends counts will surely change over time.
Here are some example attributes that generally fall into these three categories:
- Mostly Dynamic/Active:
- Tweet body, published time and link.
- Hashtags, Mentions, URLs, media and other Twitter “entities”.
- Gnip matching rules and expanded URLs.
- Tweet geographic location (if geo-tagged).
- Change more slowly:
- User/Actor follower, friend, status counts.
- Klout score and topics.
- Mostly Static:
* User/Actor account ID, display and handle names, language, timezone.
* User/Actor Profile Location and Gnip Profile Geo enrichment.
* Twitter provider object.
* Tweet language (both Twitter and Gnip classifications).
How you manage and store these metadata depends on your specific use-case and its data analysis requirements. For example, perhaps you want to track the amount of followers an account has during a on-line campaign.
Here are a few schema options to consider:
Store all metadata at the activity level
One method is to store all metadata at the activity (tweet) level so all attributes such as actor metadata are stored along with each tweet the actor posts. While this is the most simple design, it has a fundamental disadvantage. Much of these data will be static, so significant storage space is spent on redundant data. However, the required SQL for retrieving data is simple, and client-side code remains simple. If you are working with a finite dataset, such as a “one-off” from Gnip’s Historical PowerTrack product, this simple design may be adequate. However, if your dataset is continually growing you may want to condider the others below.
See HERE for an example (ActiveRecord) schema for storing all tweet metadata in a single table.
Store select dynamic and static metadata in separate tables
Another option is to segregate the metadata into two groups: attributes you want to track over time, tweet by tweet, and others that you only need to store one value for, such as the most recent value. For example, you could define a
users_static table that contains fields such as ‘preferredUsername’, ‘link’, ‘postedTime’, ‘languages’, and ‘twitterTimeZone’. Then for fields that are more likely to change you can define a ‘users_active’ table that stores fields such as ‘followersCount’ and ‘favoritesCount’.
See HERE for two example tables for storing static and dynamic attributes separately.
Store dynamic metadata at activity level and static data in its own tables
With this design the more dynamic data is stored at the activity level table, with more constant data being written to a “static” table. This is sort of a hybrid design that offers some efficiencies for storing static data along with simplicity for storing dynamic data.
This is the schema design selected for the flood project mentioned above. See HERE for that schema design.