Frequently Asked Questions


At Twitter, we focus on providing social data that is sustainable, reliable, and complete; offering access to the complete archive of publicly available Tweets through the Twitter Public APIs and Gnip’s enterprise platform. For most customers, transitioning from another data provider to Gnip will mean using Gnip’s PowerTrack products. These PowerTrack products make collecting social data easier and more valuable for our customers so they can focus on doing the analysis.


Getting Started  

Integrating directly with the Gnip APIs. For realtime data, in general, you will manage a single stream, and have multiple topics/projects/customers served through a shared stream of data. Sets of rules can be assigned common ‘tags’, and these tags can be used to segregate data after data collection. For historical data you will exercise a RESTful API to access data from the entire Twitter archive of data.

Set up your rules and rule logic for PowerTrack. Creating your PowerTrack rules and rule management logic is an important step for data accuracy. Whether you are just becoming familiar with the rule syntax or translating current desired rules, testing for desired outcomes is recommended.

Ingesting data from the Gnip APIs. Gnip customers typically directly manage both their connection(s) to the Gnip APIs and writing that data to their datastore.

Realtime PowerTrack provides customers with the ability to filter Twitter’s full realtime firehose and data is delivered to the customer’s application through a constant stream as Tweets are posted. See the realtime PowerTrack Questions section below for more details. Historical PowerTrack is a RESTful API that provides access to the entire historical archive of public Twitter data – back to the first Tweet in March 2006 – using the same rule-based-filtering system as realtime PowerTrack. See the Historical PowerTrack Questions section below for more details.

By connecting directly to the Gnip data services, you can take advantage of many enterprise-ready features that provide reliable connectivity and full fidelity data. As an enterprise licensed-access offering, realtime PowerTrack includes tools for dynamic filtering, consistent connection, data recovery and data compliance management. This technology, paired with operational monitoring, guaranteed support and integration services allows businesses to start with a strong foundation to serve their own customers.
These features include:
  1. Dynamic rule updates while connected to the stream. There is no need to disconnect your stream while you update your stream’s ruleset.
  2. Support for multiple connections to each stream.
  3. Ability to automatically recover data that is missed during brief disconnects when you reconnect within 5 minutes with Backfill.
  4. Availability of Replay streams that enable customers to recover activities missed due to technical hiccups from a rolling 5-day window of historical data.
  5. Availability of additional streams for testing and development.
  6. Status dashboard to communicate with customers about any operational issues.

Qualified customers will be assigned a representative from the Twitter team to help with your package options and contract. You will also be set up for a two-week Trial account of the PowerTrack APIs. During the Trial you will have full access to the PowerTrack APIs for client-side testing and development. At the end of the trial, your Transition Representative will review your account pricing options and next steps for a full account.

Every Gnip account includes a dashboard for monitoring realtime data volumes, connection history, and configuring output formats. The dashboard also provides information on daily and monthly data consumption, controls for managing dashboard access, and sample commands for streaming data and using the Rules API. See this video and documentation for a tour of the dashboard. While the dashboard provides a management and monitoring tool for your realtime stream, all data connections are managed completely through the PowerTrack API.

Similarly, the Historical PowerTrack API is managed completely through its RESTful API.

No. Gnip provides a single streaming connection, through which all of the data for your ruleset is delivered together. Each activity that’s delivered will contain metadata that indicates the rule or rules that matched it so that the data can be easily sorted after receipt.

PowerTrack provides a rich set of over fifty PowerTrack Operators that are used to create filtering rules. These Operators enable users to create filters based on a wide variety of metadata including Tweet keywords and phrases, user and conversation details, hashtags, user and Tweet geography, as well as URL and media details.

PowerTrack streams can support several thousands of these individual rules, and deliver the combined set of matching activities through the single data stream in realtime.

The set of PowerTrack rules used to filter a customer’s stream is highly flexible. If a customer needs to add a new filtering rule to capture a different type of content, or remove an existing rule, their application can send a request to the PowerTrack Rules API. This allows customers to provide data for many of their own customers at scale, while supporting distinct filtering requirements for each of their customers.

The two primary sources of geographic metadata in Twitter data are:

Geo-tagged Tweets: Tweets that are geo-tagged explicitly by the user with either an exact location or a Twitter Place. PowerTrack provides bounding_box, point_radius, place, place_contains, and has:geo Operators to match on Tweet geo metadata.

Profile Location: Twitter users have the option to indicate their location as part of their Twitter Profile. PowerTrack provides bio_location and bio_location_contains Operators for matching on the Profile Location. PowerTrack also has a Profile Geo enrichment that geocodes and normalizes users’ Profile Location, along with a rich set of Profile Geo Operators.

We have several articles which discuss how to geo reference Twitter data with PowerTrack:
  1. Geo Introduction
  2. Filtering Twitter by Location
  3. Geo Referencing with Twitter
  4. Visualizing Geo Data with Twitter

All data delivered from PowerTrack products are marked-up in JSON using the UTF-8 character set. There are two output formats available from PowerTrack:
  1. “Original” format: Same format as delivered from Twitter’s Public API. When PowerTrack delivers this format, the Tweet payloads are passed through without altering the structure, and no data enrichments are added in. See Twitter's documentation for more details.
  2. “Activity Stream” format: Gnip uses the open Activity Streams format to provide a normalized structure that is designed to be used with various types of social data. See Gnip’s support article Activity Streams Intro for more details.
Also, see Sample Payloads for a side-by-side comparison of a tweet in the two formats.

There are a variety of things that can happen to Tweets and User accounts that impact how they are displayed on the platform. The state of a User or Status can change at any time due to these actions, and this impacts how consumers of Twitter data are expected to treat the availability and privacy of all associated content. When these actions happen, a corresponding compliance message is sent that indicates that the state of a Status or User has changed. We have developed tools to help our customers continually keep their data “in compliance”, honoring both the privacy and intent of Twitter users and ensuring the continued growth of the platform. Also review our Compliance API.

No, Compliance events are not part of the payload that is delivered in PowerTrack streams. However, the Gnip Compliance API was built to enable customers to receive notifications of Compliance events associated with the data they have consumed.

No, these types of message are private and not available via PowerTrack. Please see our REST API documentation for more details on how to access Direct Messages at https://dev.twitter.com/rest/public.

Realtime PowerTrack  

Realtime streams of data are initiated by sending a HTTP GET command to your custom https://stream.gnip.com URL. HTTP streaming connections are requested with HTTP headers that indicate a ‘keep-alive’ connection. More information on realtime streaming is available HERE. Also, see this video for a walkthrough of adding rules to a stream and streaming data.

Click HERE for coding examples and sample client libraries. Note that the client libraries are not formally supported and are only presented as examples.

Given the potential of high volumes of Twitter data delivered in a stream, it is highly recommended that incoming data is processed in an asynchronous fashion. What this means is that your code that ‘hosts’ the client side of the stream simply inserts incoming Tweets into a (FIFO) queue, or similar memory structure, and then you have a separate process/thread that consumes Tweets from that queue and does more of the ‘heavy lifting’ of parsing and preparing the data for storage. With this design you can implement a process that will bend but not break in case incoming data volumes changes dramatically.

The vast majority of realtime PowerTrack users manage multiple customers, projects, and campaigns within a single realtime stream by using PowerTrack rule ‘tags’. Rule tags have no special meaning, they are simply treated as opaque strings carried along with the rule. They will be included in the “matching rules” metadata in activities returned. Tags provide an easy way to create logical groupings of PowerTrack rules. For example, you may generate a unique ID for a specific rule as its tag, and allow your application to reference that ID within activities it processes to associate a result with specific customers, campaigns, categories, or other related groups.

PowerTrack streams support multiple connections to a single endpoint. Having multiple connections enables customers to build redundant data consumer clients, ideally on different networks. While PowerTrack streams default to a single connection, many customers prefer to have two connections per PowerTrack stream to ensure continuous delivery. If multiple connections are made to a single endpoint, and/or multiple streams exist with common rules, a given Tweet will be received multiple times. Note that for accounting purposes, the Tweet will be counted once.

When streaming data, the goal is to stay connected for as long as possible, recognizing that disconnects will occur. PowerTrack streams provide a 15-second heartbeat (in the form of a new-line character) that enable client applications to detect disconnects. When fresh data and the heartbeat stop arriving, reconnection logic should be triggered. In most software languages this can be easily implemented by setting a data read timeout.

Any time you disconnect from the stream, you are potentially missing data that would have been sent if connected. However, Gnip provides multiple ways to mitigate these disconnects and recover data when they occur.

Gnip has a range of tools available for retrieving historical tweets.
  1. Redundant Streams With multiple connections, consume the stream from multiple servers to prevent missed data when one is disconnected.
  2. Replay Recover data from within the last 5 days using a separate stream.
  3. Backfill Reconnect within 5 minutes and start from where you left off.
  4. Historical PowerTrack Recover data from the entire Twitter archive.

PowerTrack clients ingest data in a wide variety of ways, the most common methods being writing to a relational database, inserting Tweets into a NoSQL datastore, or adding Tweets to a queuing system. Some customers write incoming data to flat-files to be ingested by legacy data management systems.

Assuming the backend infrastructure already exists, writing data to a NoSQL datastore or a queuing system is relatively straightforward and does not require a lot of code. For example, HERE HERE is an example Java class that manages writing Tweets to a MongoDB datastore.

Writing to a relational database can be a bit more involved since you need to define the database schema to store some data. It is likely that you have already crossed that bridge (and if you have not, check out this series of articles about designing schemas for storing Twitter data), and PowerTrack integration will focus on establishing a database connection (local or cloud-based), parsing the Tweet JSON, building your SQL statement and executing it.

Yes, there are a variety of realtime code examples available on the Gnip Support site, including some third-party libraries. There are also examples for integrating with the PowerTrack Rules API HERE.

If you are going to integrate using Java, another option is Twitter's Hosebird Client library. The Enterprise Stream Enterprise Stream example can be used with Gnip streams.

Search API  

The Search API is a RESTful API that provides low-latency, full-fidelity, query-based access to the last 30 days of the Twitter archive. The Search API supports two types of requests – Search Requests and Estimate (Tweet Count) Requests. Each API request includes a query, using PowerTrack syntax.

The Search API operates with a single rule/filter at a time. The rule can be up to 1,024 characters in length, with up to 30 positive clauses (things you want to match on) and up to 50 negative clauses (things you do not want to match on).

The Search API implements a pagination mechanism to return potentially large volumes of data in easily consumable pieces. Search API request parameters include a ‘maxResults’ parameter that determines the maximum amount of Tweets to include in each API response. The default for this parameter is 100, and has a supported range of 10-500. If the Search API query has more than maxResults to return, a ‘next’ token is provided with each response until all the data is returned. This 'next' token is added as a parameter to the subsequent call, along with your original query.

Beyond the ‘maxResults’ parameter already mentioned, the others are the rule/filter you are supplying as the ‘query’ parameter and ‘fromDate’ and ‘toDate’ parameters that define the search period of interest.

Yes, the Search API provides a ‘counts’ endpoint that returns an estimate of the Tweets associated with your rule/filter. This ‘counts’ endpoint provides an array of matches on either a minute-by-minute, hourly or daily interval for the period of interest.

These counts are considered an estimate because they do not account for deleted data. For this reason, the total counts provide an upper-bound on the number of Tweets that will be available when and if the data is requested.

Note that the counts endpoint has primary application in a number of use-cases. Some customers use the counts endpoint to ‘look before they leap’ when adding new rules to their realtime stream. Other use-cases focus more on data velocities and volumes than the actual data itself.

Tweets are available through the Search API approximately 30 seconds after they are posted on Twitter.

The Search API provides Tweets from the last 30 days.

No, there are a subset of realtime/Historical PowerTrack Operators that are not available with the Search API. In short, all of the substring (‘contains’), numeric ‘count’ and the sample Operators are not supported. For a complete list, see HERE.

Both ‘original’ and ‘Activity Stream’ JSON formats are available with the Search API. The type of output format is set at a global Account level, and this can not be specified on a per-request basis.

It depends. In general, when consuming data from within the past 30 days, the Search API will likely be the preferred product. However, for very high volume queries Historical PowerTrack may be a better fit. For example, say a query is associated with 1,000,000 Tweets over the last 30 days. With a maxResults of 500 Tweets per API response, it will take approximately 2,000 requests to the Search API to harvest all of the data.

With Historical PowerTrack all of this data could be retrieved in a single Job that would likely complete faster than making that many calls to the Search API.

Another consideration is that Historical PowerTrack can support up to 1,000 rules/filters per Job. So if you have a large set of rules/filters to search with Historical PowerTrack may be a more efficient process.

Historical PowerTrack  

Historical PowerTrack (HPT) API provides a data service that is used to filter and download data from the Twitter archive, starting with the first Tweet in March 2006. Historical data are generated through a series of steps, and the API is used to navigate through these steps:

  1. Create Job.
  2. Review Job estimate.
  3. Accept or reject Job.
  4. Monitor accepted Job progress.
See HERE for more Historical PowerTrack documentation.

You create a Historical PowerTrack Job by creating a JSON object (see example below) with several attributes including the dates that define your search period and a “rules” array that define up to 1,000 rules. The JSON Job definition is submitted by making POST request to your Historical PowerTrack endpoint. See HERE for more API details on creating a Job.

Every Historical PowerTrack Job can have up to 1,000 rules/filters associated with it. These rules are written using the same rule-based filtering system as realtime PowerTrack. As with realtime PowerTrack, using rule tags is considered a best practice.

Yes. After a Job is created an estimate will be generated that will estimate the number of tweets associated with your Job, as well as estimates on download payload size and how long the job should take to complete.

Based on these estimates you have the option to either accept or reject the Job.

The Historical PowerTrack data service generates a time-series of data files. Each file is gzip compressed and contains Tweets in JSON format (either ‘original’ or Activity Stream format). Each file covers a 10-minute segment of the overall job: a 1-hour job will contain up to 6 files, and a year-long may contain over 52K files.

When a Job is complete, a Data URL is provided. This URL returns a list of download links, one for each file generated.

See this article for more information on downloading Historical PowerTrack files.

Data is available for 15 days after the Job is accepted. After 15 days the data expires and is deleted.

Yes, at the time of compiling the Job’s data and generating the file set, the data is Compliant. Deleted tweets are not available through Historical PowerTrack (or any other PowerTrack product).

PowerTrack Rules  

PowerTrack provides a rich set of over fifty Operators that are used to match on different Tweet attributes such as hashtags, links, and mentions. See our support site documentation for the complete list of Operators, complete with examples.

On a fundamental level, it helps to think about Gnip PowerTrack rules as independent, or atomic, filters that are applied to the Twitter firehose simultaneously in realtime. These rules are completely independent of each other.

PowerTrack’s products and filtering feature set evolved from Gnip’s core mission to build enterprise-ready data streams for their customers to build on top of.

In PowerTrack, each rule is evaluated independently, and exclusions must be included in each rule explicitly in order for them to apply. PowerTrack does not support a stream-wide exclusion list.

A single PowerTrack rule can contain up to 30 positive clauses, and 50 negative clauses, and standard 1024 characters. Your PowerTrack stream supports up to 250,000 of these rules, which can be added in batches of up to 1Mb (in JSON format)

Yes, PowerTrack rulesets used by a customer’s stream can be dynamically updated while connected to the stream. If a customer needs to add a new filtering rule to capture a different type of content, or remove an existing rule, their application can send a request to the PowerTrack API to make it happen. When that request is sent, the filtering rules are automatically modified and the changes simply take effect in the data stream with no need to reconnect. This allows customers to provide data for many customers at scale, while supporting distinct filtering requirements for each of those customers.

Rule tags can be used to group rules into multiple logical sets. Tags are used to segregate rules (and the Tweet data they match on) into different client, campaign, and project groups.

Rule tags have no special meaning, they are simply treated as opaque strings carried along with the rule. They will be included in the “matching rules” metadata in activities returned. Tags provide an easy way to create logical groupings of PowerTrack rules. For example, you may generate a unique ID for a specific rule as its tag, and allow your app to reference that ID within activities it processes to associate a result with specific customers, campaigns, categories, or other related groups.

Note that tags cannot be updated on an existing rule, but can only be included when a rule is created. In order to “update” a tag, you need to first remove the rule, then add it again with the updated tag. The best solution is to simply use a unique identifier as your tag, which your system can associate with various other data points within your own app, all without having to change anything in the PowerTrack rule set.

PowerTrack filtering does not support wildcards. The contains: operator allows a substring match for activities that have the given substring in the body, regardless of tokenization. In other words, this does a pure substring match, and does not consider word boundaries. Use double quotes to match substrings that contain whitespace or punctuation.

Gnip returns the original (short) URL, the final expanded URL, and the status code associated with the HEAD requests used to resolve the URL. Other metadata like the page’s title and content are not returned.

Twitter quality (spam) filters the the firehose, including filtering entire accounts from the data that is sent through the firehose. For these accounts, Gnip is never sent the Tweets being posted. To see whether an account is being spam filtered, go to search.twitter.com in your browser, and enter from:username in the search form (substituting the user account you’d like to check for ‘username’). If the result comes back as either “No Tweet results for…”, or only very old Tweets despite the user’s account having recent Tweets, Twitter is filtering that account from the Search and Streaming APIs, as well as the firehose.

Gnip PowerTrack does not provide any sentiment-based data enrichments. Gnip has always focused on developing the enterprise-level data products that enable our customers to build such tools. We value our ecosystem partners in this area and recognize the many nuances involved in effective sentiment analysis.

If sentiment analysis is key to your use-case, and you decide not to generate that analysis in-house, reach-out to your Gnip representative for more information on Gnip partners that can help.

PowerTrack filtering does not support Regular Expressions (RegEx). PowerTrack does provide the contains: Operator for substring matches. For some RegEx use-cases the proximity: Operator can be helpful.

Anything Else  

No. Currently, we support delivery of data via a persistent Streaming HTTP connection to the API.


To get started, we suggest you checkout the articles and documentation on this site. In particular, the following may be most helpful to introduce you to our products: