Storing Twitter Data in Relational Databases - Part 1
posted on 25 February 2015 by Jim Moffitt
There are many ways to store social media data, such as in files, NoSQL-type datastores, and relational databases. This is an introduction to a series of articles on storing Twitter data in relational databases.
There are several key questions to ponder as you design your database schema:
- What metadata is provided and what of it is needed for analysis and research?
- How big will your database get? Are you continually filling your data store from a 24/7 stream of data? Or are you working with a historical and static dataset?
- How long will the data be stored? Will the stored data be from a moving windows of time, say 90 days, or will the database continually be added to?
- Many attributes of social data are delivered as arrays with variable lengths. For Twitter data, these include hashtags, user mentions, and URLs. Given that database schemas are structurally static, how will these arrays be stored?
- Many attributes of social data do not change very often. For example, tweet data includes metadata about the author that rarely changes, such as their display name, account ID, profile location, and bio description. Other attributes change slowly, such as follower, friend, and favorite counts. Does your use-case involve tracking these changes, and what trade-offs are there for doing so?
In this series of articles we’ll explore these fundamental decisions, discuss options when designing your database schema, and provide some example schemas for getting started.
At the highest level:
- Database schemas consist of tables that are made up of one of more rows.
- Table rows are made up of one or more fields.
- Field names and types describe the table schema.
- Database schemas also involve details such as primary keys, foreign keys, and indexes.
- Primary keys are made up of individual or groups of fields.
- Primary keys provide a mechanism to enforce uniqueness of your table contents by preventing duplicates.
- Indexes are helpers for searching and selecting data.
- When you define an index you are asking your database engine to essentially pre-sort your data, making it faster to search.
- Indexes come with overhead since they consist of copies of your data and thus make your database size larger. Therefore they should be created based on the types of queries your use-case needs to support.
- Foreign Keys define how one table relates to another table. Foreign keys are made up of one or more fields in one table that uniquely identifies a row in another table.
The following discussion will focus mainly on suggested options for specifying tables and fields, and less on recommendations for defining indexes. Creation of indexes should be driven by your data retrieval patterns.
The examples below are based on storing Twitter data in a database. If you are working with data from another social network, these examples will still illustrate the type of design considerations and potential techniques for storing your data. If you are storing data from multiple sources it is likely that there are some fundamental metadata common to all such as posted time, activity IDs and author IDs. Other important details will be source-specific, such as the type of activity (post or ‘like’) and the length of the activity “body” (short tweet or long blog post). While it is certainly possible to store a mix of sources in a single table, there are many advantages to storing sources in their own unique tables.
What activity metadata do you need to store?
When storing activity data (in this case tweets) in a database, you are essentially passing the data through a transform where you cherry-pick the data you care about. Inserting social media data into a database provides an opportunity to filter the incoming data, explicitly storing the data you want to keep, and ignoring the data you do not want.
Every tweet arrives with a large set of supporting metadata. This set can contain well over 150 attributes that provide information about the user that posted the tweet, any available geographic information, and other information such as lists of hashtags and user mentions included in the tweet message.
Given your particular use-case you may only need a subset of this supporting metadata and decide not to store every piece of data provided. For example, the standard Twitter metadata includes the numeric character position of hashtags in a tweet message. You may decide that you do not need this information, and therefore can omit those details from your database schema.
To filter out such data means simply that you do not have a field in your database to store it, and when parsing the tweet payload you simply ignore the attribute.
An Example Use-Case
Every data storage effort is driven by user and research ‘stories.’ In the end database schemas are driven by the type of questions you want to explore with social data. Given my background and interest in flood-warning systems I wanted to explore the role Twitter data played in the 2013 Boulder Flood – an historic rain and flood event that occurred September 12-15 and had lasting affects across the Boulder region. The questions I wanted to explore were:
- How did the Twitter signal track with local rain gauge data?
- Previous analysis of most less intense events revealed a strong signal. How would the comparison of rain gauge data and Twitter data differ for a 1000-year flood?
- How did the Twitter signal track with local stage gauge data?
- How did that signal attenuate as the flood waters moved downstream?
- How many unique users posted during the event?
- What precentage of tweets were geo-tagged during the event?
- How many real-time photos and video were coming out on Twitter during the event?
- How did the followers of local public safety agencies change during and after the 2013 flood?
These questions were posed in order to characterize the role Twitter played during the flood and its aftermath. These types of questions guided the design of the database schema at the heart of our look at the 2013 Boulder flood.
For a look at the resulting blog post, see HERE.
How can I protect myself from later realizing that there are key metadata that I haven’t been storing?
When you design your database schema you are explicitly saving selected data, and anything not selected is ignored and not available for future use. Given this, it pays to carefully identify what data you need the first time.
When working with Twitter data for some flood-related blog posts we were not originally storing follower counts in its own field. Later we wanted to study how follower counts for public agencies increased during and after flood events. Luckily, we had two options to provide that opportunity.
First, we were working with Historical PowerTrack data and we had saved the time-series data files that that product generates. It is considered a best-practice to store the complete ‘raw’ data. Complete JSON payloads can be inserted into a NoSQL system, such as MongoDB, or as flat-files. Doing this provides a redundant datastore in case you have database problems. Also, assuming you are not storing every metadata attribute, this provides an ‘insurance policy’ against discovering that there is metadata needed for analysis that have not been part of your database schema.
Second, when we designed the database schema for the flood project we decided to store each JSON payload as a separate field in my “activity” table. For this project we knew going in that we were going to have less than 500,000 activities so the overhead from this extra storage seemed worth it. For many common use cases, this type of redundant storage will not scale very well.
It should be noted that regardless of the method here, it can be a bit painful to parse and load the “missing” JSON data a second time. It is highly recommended to learn from my mistake and get your schema specified correctly the first time!