Choosing a Historical API

posted on 20 June 2017 by Jim Moffitt


Introduction  

Both Historical PowerTrack and Full-Archive Search provide access to any publicly available Tweet, starting with the first Tweet from March 2006. Deciding which product better suits your use case can be key in getting the data you want when you need it. This article highlights the differences between the two historical products, with a goal of helping you determine which one to use for generating your next historical Tweet dataset.

Both historical products scan a Tweet archive, examine each Tweet posted during the time of interest, and generate a set of Tweets matching your query. However, Historical PowerTrack and Full-Archive Search are based on significantly different archive architectures, resulting in fundamental product differences. These differences include supported PowerTrack Operators, the number of rules/filters per request, how estimates are provided, and how the data is delivered.

We’ll start off by providing an overview of Historical PowerTrack and the Full-Archive Search API, then discuss these differences in detail. We’ll wrap up the discussion with general guidance for selecting a historical product.

Historical PowerTrack  

Historical PowerTrack (HPT) is built to deliver Tweets at scale using a batch, Job-based design where the API is used to move a Job through multiple phases. These phases include volume estimation, Job acceptance/rejection, getting Job status, and downloading potentially many thousands of data files. Depending on the length of the request time period, Jobs can take hours or days to generate. A data file is generated for each 10-minute period that contains at least one Tweet. Therefore, a 30-day datasets will commonly consists of approximately 4,300 files regardless of the number of matched Tweets.

Historical PowerTrack, launched in 2012 by Gnip, was built on top of a file-based archive, with each file containing a short duration of the real-time Tweet firehose. As Historical PowerTrack generates your dataset, it performs file operations as it opens each file, and examines each Tweet in the file. During this process, Historical PowerTrack accesses all sections of the JSON payload that have an associated PowerTrack Operator. Historical PowerTrack supports the full set of PowerTrack Operators supported by real-time PowerTrack.

Full-Archive Search  

Full-Archive Search (FAS) delivers matched Tweets in a manner analogous to Google Search. When you search for a particular term, Google Search does not return or display the potentially millions of matching results all at once. Rather, it delivers a small number of results per scrollable page and allows you to click “Next” for the subsequent page of results… and so on, until all of the matching items are returned. Full-Archive Search works in a similar way, responding with a subset of Tweets, with the ability to paginate for more Tweets as needed.

Full-Archive Search is designed using the classic RESTful request/response pattern, where a single PowerTrack rule is submitted and a response with matching Tweets is immediately provided. Full-Archive Search can provide a maximum of 500 Tweets per response, and a ‘next’ token is provided to paginate until all Tweets for a query are received. The more Tweets that matches your query the longer it will take you to retrieve all of that data through the Search API. This is because you will need to stitch together the paginated result sets, one after the other, to create the complete set of all returned data.

Many use cases focus on the number of Tweets associated with a query, rather than the actual Tweet messages themselves. Full-Archive Search supports a ‘counts’ endpoint that returns a time-series of the number of matched Tweets. These Tweet counts are returned in a time-series of minute-by-minute, hourly, or daily totals.

Note that Twitter also provides a 30-Day Search API. If you only need data from the last 30 days, this API is likely the best match for your use case.

Product Differences  

Here are the fundamental differences between Historical PowerTrack and Full-Archive Search:

  • Number of rules supported per request

    • A Historical PowerTrack Job can support up to 1,000 rules.
    • Full-Archive Search accepts a single rule per request.

    • Note: With each product a single rule can contain up to 2,048 characters.

  • How data is delivered

    • Historical PowerTrack generates a time-series of data files, each covering a ten-minute period. For example, each hour of data is provided in six 10-minute data files (assuming each 10-minute period has at least one Tweet. If not, no file is generated). So, once a Historical PowerTrack ‘Job’ is finished, the next step is downloading the data files. Inside each Historical PowerTrack file, the JSON Tweet payloads are written in an atomic fashion, and are not presented in an JSON array. File contents need to be parsed using newline characters as a delimiter.
    • With Full-Archive Search, Tweets in each response are arranged in a “results” array. A maximum of 500 Tweets are available per response and a ‘next’ token is provided if more Tweets are available. For example, if a 60-day request for a single PowerTrack rule matches 10,000 Tweets, at least 20 requests must be made of the Search API.

  • Supported PowerTrack Operators

    • While the majority of Operators supported by HPT are also supported by FAS, there are a set of Operators not available in FAS:
bio: profile_bounding_box:
bio_location: profile_point_radius:
bio_name: profile_subregion:
contains: retweets_of_status_id:
is:quote sample:
followers_count: source:
friends_count: statuses_count:
has:lang url_contains:
in_reply_to_status_id:    url_description:
listed_count: url_title:
time_zone:

  • Data Volume Estimates

    • Full-Archive Search provides a ‘counts’ endpoint that is used to generate a minutely, hourly, or daily time-series of matching Tweets. For use cases that benefit from knowing about data volumes, in addtion to the actual data, the Full-Archive Search ‘counts’ endpoint is the tool of choice. Note that the ‘counts’ endpoint is a measure of pre-compliant matched Tweets. Pre-compliant means the Tweet totals do not take into account deleted and protected Tweets. Data requests will not include deleted or private Tweets.

    • The Historical PowerTrack API provides an order of magnitude estimate for the number of Tweets a Job will match. These estimates are based on a sampling of the time period to be covered, and should be treated as a directionally accurate guide to the amount of data a historical Job will return. A Historical PowerTrack estimate will help answer whether a Job will match 100,000 or 1,000,000 Tweets. The goal is to provide reasonable expectations around the amount of data a request will return, and the Historical PowerTrack API should not be used as an estimate tool.

  • Product licensing and pricing

Both the Full-Archive Search and Historical PowerTrack APIs are available with 12-month subscriptions. In addition, Historical PowerTrack is available on a ‘one-off’ basis. So if you need historical Tweet data for a one-time project or ad hoc research, Historical PowerTrack will be better suited to your needs.

Selecting a Historical Product  

Full-Archive Search is best for collecting lower-volume datasets, while Historical PowerTrack is more appropriate for higher-volume datasets. Our general recommendation is that Full-Archive Search is best suited for datasets of a few million Tweets or less. This recommendation is intentionally vague as there is no real technical reason why Full-Archive Search could not be used for very large data requests. However, depending on how you plan to retrieve the data and utilize API rate limits, there actually is a threshold where Historical PowerTrack actually processes the data faster than Full-Archive Search. Say a Historical PowerTrack Job delivers 8 million Tweets in 6 hours. To pull 8 million Tweets via Full-Archive Search, it would require a minimum of 16,000 search requests. If you also assume 2-second response times and an app that paginates with a single thread, it would actually take Full-Archive Search longer to retrieve this 8 million Tweet dataset.

Beyond data volume considerations and comparing completion times, there are other reasons why one historical product is best suited for your use case. Historical PowerTrack is the right product if you require any of the Operators that are not currently supported in Full-Archive Search (see above). Historical PowerTrack is great for less time sensitive research. Historical PowerTrack is also better suited for large query rulesets, as the Search API products only support a single PowerTrack rule per request. Historical PowerTrack, on the other hand, supports up to one thousand (1,000) rules. Finally, Historical PowerTrack is a great choice for use cases where receiving, in real-time, new Tweet attributes of interest (e.g. hashtags, mentions and URLs) trigger a historical request. Since Historical PowerTrack supports the same full set of Operators as real-time PowerTrack, you can always ‘plug in’ the corresponding real-time rules into a Historical PowerTrack Job.

On the other hand, Full-Archive Search is a better choice if you are building tools that depend on near-instant results and data volume estimates. Quick responses are key if you are building historical Tweet results into a dashboard or user application. Full-Archive Search has been built into many systems that enable users to create filters and instantly inspect Tweets that match. Depending on initial responses, users can continue to paginate through the results, or instead broaden or narrow the filter and retry. Since Full-Archive Search provides more accurate volume estimates with its ‘counts’ endpoint, another common Search use case is experimenting with filters before adding them to a real-time stream.

Next Steps