Gnip’s Full-Archive Search API offers access to the full corpus of Twitter data dating all the way back to the first Tweet in March 2006, much like Gnip’s Historical PowerTrack product.

The Full-Archive Search API is a RESTful API that supports one query per request. Queries are written with the standard Gnip PowerTrack rule syntax. Users can specify any time period, to the granularity of a minute, going back to March 2006. However, responses will be limited to the lesser of your specified maxResults OR 31 days and include a ‘next’ token to paginate for the next set of results. If time parameters are not specified, the API will return matching data from the 30 most recent days.

Requests include a ‘maxResults’ parameter that specifies the maximum number of Tweets to return per API response. If more Tweets are associated with the query than this maximum amount of results per response, a ‘next’ token is included in the response. These ‘next’ tokens are used in subsequent requests to page through the entire set of Tweets associated with the query.

Much like Gnip’s current 30-day Search API product, users can also request counts (data volume) for the time period they are interested in. While Full-Archive Search API supports the PowerTrack filtering syntax, please note that not all PowerTrack Operators are currently supported (see Available Operators).

Search Requests

Search requests to the Full-Archive Search API allow you to retrieve up to 500 results per response for a given timeframe, with the ability to paginate for additional data. Using the maxResults parameter, you can specify smaller page sizes for display use cases (allowing your user to request more results as needed) or larger page sizes (up to 500) for larger data pulls. The data is delivered in reverse chronological order and compliant at the time of delivery.

Counts Requests

Counts Requests provide the ability to retrieve historical activity counts, which reflect the number of activities that occurred which match a given query during the requested timeframe. The response will essentially provide you with a histogram of counts, bucketed by day, hour, or minute (the default bucket is hour). These counts do not reflect any later compliance events (deletions, scrub geos), and some activities which are counted may not be available to retrieve with a Search request due to user compliance actions.

Data Availability / Important Dates

  • First Tweet: 3/21/2006
  • First Native Retweets: 11/6/2009
  • First Geo-tagged Tweets: 11/19/2009
  • URLs First Indexed for filtering: 8/27/2011
  • Enhanced URL Expansion metadata (website titles and descriptions): 12/1/2014

Filtering Syntax

Full-Archive Search API supports rules with up to 2,048 characters with no limits on the number of positive and negative clauses. However, only a subset of PowerTrack Operators are initially supported.


Available Operators 

The Full-Archive Search API currently supports the following operators:

  • keyword
  • “quoted phrase”
  • “keyword1 keyword2”~N
  • from:
  • to:
  • retweets_of:
  • lang: (Note: this Operator references language classifications provided by Twitter - see Language Classifications for more info)
  • #
  • @
  • $
  • url_contains:
  • bounding_box:[west_long south_lat east_long north_lat]
  • point_radius:[lon lat radius]
  • is:verified
  • is:retweet
  • has:geo
  • place:
  • place_country:
  • has:profile_geo
  • profile_country:
  • profile_region:
  • profile_locality:
  • has:mentions
  • has:hashtags
  • has:media
  • has:videos
  • has:images
  • has:links
  • has:symbols

Note: The ‘lang:’ operator and all ‘is:’ and ‘has:’ operators cannot be used as standalone operators and must be combined with another clause (e.g. @gnip has:links)


Operator Differences / Matching Changes 

For the initial release of the Gnip Full-Archive Search API, some operators will behave or match differently than they do in Gnip’s 30-Day Search and PowerTrack.

all text matching Accented and special characters are normalized to standard latin characters, which can change meanings in foreign languages or return unexpected results:
e.g. "músic" will match “music” and vice versa
e.g. common phrases like "Feliz Año Nuevo!" in Spanish, would be indexed as "Feliz Ano Nuevo", which changes the meaning of the phrase
quoted phrase Punctuation is not tokenized and is instead treated as whitespace.
e.g. "Search API for Twitter" will match "http://blog.gnip.com/search-api-for-twitter"
e.g. "Love Snow" will match "#love #snow"
e.g. "#Love #Snow" will match "love snow"
e.g. quoted "#hashtag" will match "hashtag" but not #hashtag (use the hashtag # operator without quotes to match on actual hashtags
e.g. quoted "$cashtag" will match "cashtag" but not $cashtag (use the cashtag $ operator without quotes to match on actual cashtags
e.g. "Hello World" matches "Hello. World", "Hello, World", etc.
from:
to:
@
retweets_of:
Both Account usernames and numeric IDs can be used. Since usernames can change, we recommend use of from:numeric_user_id whenever possible.
lang: Uses Twitter’s per-Tweet language classification rather than Gnip’s (which is being deprecated). See the Language Classifications section for a list of supported languages.
url_contains: Punctuation is not tokenized and is instead treated as whitespace.
e.g.:"www.twitter.com" is tokenized into "www", "twitter", "com" and thus url_contains:”www.twitter.com” will match against Tweets with http://www.twitter.com/… and "www.youtube.com/www/twitter/com", or any url with those tokens.


Notes on url_contains Operator

In the Full-Archive Search API, the url_contains operator behaves in a slightly different manner than the equivalent operator in realtime PowerTrack streams. Specifically, url_contains in Search does not act as a pure substring match. In Search, url_contains: must be used with complete words or groups of complete words, but NOT partial words or substrings.

For example:

Target URL Matching Rules Non-Matching Rules
http://blog.gnip.com/search-api-for-twitter/ url_contains:"blog.gnip.com/search-api-for-twitter/"
url_contains:"gnip.com"
url_contains:"search-api"
url_contains:"blog.gnip.com"
url_contains:"gnip.com/search"
url_contains:api
url_contains:twitter
url_contains:"blog.gnip.com/sea"
url_contains:"log.gnip.com"
url_contains:"ip.com"
url_contains:"ip"

Additional Notes

Below are some additional notes on possible difference in matching behavior between Search Full-Archive and 30-Day:

  • The last characters of a RT may get truncated, causing some Tweets to not match.

Data Updates and Mutability 

Unlike Gnip’s 30-day Search API and other historical products (Historical PowerTrack, Replay), some of the data within a Tweet is mutable, i.e. can be updated or changed after initial archival.

This mutable data falls into two categories:

  • Metadata around a user/actor object:
    • User’s @handle (numeric ID does not ever change)
    • Bio description
    • Counts: statuses, followers, friends, favorites, lists
    • Profile location
    • Other details such as time zone and language
  • Tweet statistics - i.e. anything that can be changed on the platform by user actions (examples below):
    • Favorites count
    • Retweet count

In most of these cases, the Search API will return data as it exists on the platform at query-time, rather than Tweet generation time. However, in the case of queries using select operators (e.g. from, to, @, is:verified), this may not be the case. Data is updated in our index on a regular basis, with an increased frequency for most recent timeframes. As a result, in some cases the data returned may not exactly match the current data as displayed on Twitter.com, but matches data at the time it was last indexed.

Note, this issue of inconsistency only applies to queries where the operator applies to mutable data (e.g. user and user-bio related operators and counts-based operators). We will not be supporting many such operators initially, but will offer more in the future. The most problematic example is indeed filtering for usernames, and the best workaround would be to use userIDs rather than @handles for these queries.


Enrichment Availability 

  • Language Classification: In the Full-Archive Search API, Gnip’s language classification has been deprecated in favor of Twitter’s language classification, which is present in the payload and available for filtering. In support of this change, the twitter_lang: operator has been deprecated, with the lang: operator now applying to Twitter’s classification. See the Language Classifications below for more information on supported languages.

  • Klout: Gnip’s Klout enrichment is now available in the payload via Full-Archive Search API. We still do not offer the ability to filter on Klout, but plan to in the future.


Single vs. Multi-threaded Requests 

Each customer has a defined rate limit for their search endpoint. The default per-minute rate limit for Full-Archive search is 120 requests per minute, for an average of 2 queries per second (QPS). This average QPS means that, in theory, 2 requests can be made of the API every second. Given the pagination feature of the product, if a one-year query has one million Tweets associated with it, spread evenly over the year, over 2,000 requests would be required (assuming a ‘maxResults’ of 500) to receive all the data. Assuming it takes two seconds per response, that is 4,000 seconds (or just over an hour) to pull all of that data serially/sequentially through a single thread (1 request per second using the prior response’s “next” token). Not bad!

Now consider the situation where twelve parallel threads are used to receive data. Assuming an even distribution of the one million Tweets over the one-year period, you could split the requests into twelve parallel threads (multi-threaded) and utilize more of the per-second rate limit for the single “job”. In other words, you could run one thread per-month you are interested in and by doing so, data could be retrieved 12x as fast (or ~6 minutes).

This multi-threaded example applies equally well to the counts endpoint. For example, if you wanted to receive Tweet counts for a two-year period, you could make a single-threaded request and page back through the counts 31 days at a time. Assuming it takes 2 seconds per response, it would take approximately 48 seconds to make the 24 API requests and retrieve the entire set of counts. However, you also have the option to make multiple one-month requests at a time. When making 12 requests per second, the entire set of counts could be retrieved in approximately 2 seconds.


Language Classifications 

The list below represents the supported language classifications and their corresponding BCP 47 language indentifier:

  • Amharic - am
  • Arabic - ar
  • Armenian - hy
  • Bengali - bn
  • Bosnian - bs
  • Bulgarian - bg
  • Cherokee - chr
  • Chinese - zh
  • Croatian - hr
  • Danish - da
  • Dutch - nl
  • English - en
  • Estonian - et
  • Finnish - fi
  • French - fr
  • Georgian - ka
  • German - de
  • Greek - el
  • Gujarati - gu
  • Haitian - ht
  • Hebrew - iw
  • Hindi - hi
  • Hungarian - hu
  • Icelandic - is
  • Indonesian - in
  • Inuktitut - iu
  • Italian - it
  • Japanese - ja
  • Kannada - kn
  • Khmer - km
  • Korean - ko
  • Lao - lo
  • Latvian - lv
  • Lithuanian - lt
  • Malayalam - ml
  • Maldivian - dv
  • Marathi - mr
  • Myanmar-Burmese - my
  • Nepali - ne
  • Norwegian - no
  • Oriya - or
  • Panjabi - pa
  • Pashto - ps
  • Persian - fa
  • Polish - pl
  • Portuguese - pt
  • Romanian - ro
  • Russian - ru
  • Serbian - sr
  • Sindhi - sd
  • Sinhala - si
  • Slovak - sk
  • Slovenian - sl
  • Sorani Kurdish - ckb
  • Spanish - es
  • Swedish - sv
  • Tagalog - tl
  • Tamil - ta
  • Telugu - te
  • Thai - th
  • Tibetan - bo
  • Turkish - tr
  • Ukrainian - uk
  • Urdu - ur
  • Uyghur - ug
  • Vietnamese - vi
  • Welsh - cy




Continue to the API Reference