PowerTrack,Firehose, and Replay streams utilize Streaming HTTP protocol to deliver data through an open, streaming API connection. Rather than delivering data in batches through repeated requests by your client app, as might be expected from a REST API, a single connection is opened between your app and the API, with new results being sent through that connection whenever new matches occur. This results in a low-latency delivery mechanism that can support very high throughput. 

The overall interaction between your app and Gnip’s API for these streams is described as follows:

Step 1: Configure your stream

To configure your stream, you simply need to select a data format, and if you are using a PowerTrack stream, set up some rules to define what content should be delivered through your stream.

Gnip’s streams will deliver our normalized Activity Streams format by default, but in some cases the native format as provided by the publisher may be available as well.

For details on building PowerTrack rules, and adding them to your stream via the PowerTrack Rules API, see here.

After you have configured your stream, you are ready to connect.

Step 2: Connect to the API

To open the data stream to have activities delivered, you need to send a connection request to the API. In the streaming model, this connection opens up the pipeline for data to be delivered to you as it happens, and will exist for an indefinite period of time. See the documentation for the specific APIs for information on establishing the connection.

Step 3: Consume the data as it’s delivered

Once the connection is established, Gnip’s stream sends new activities through the open connection as they happen, and your client app should read the data off the line as it is received. Your client app will need to recognize and handle various types of messages, which are described in our documentation below.

Step 4: When disconnected, reconnect to the API

Inevitably, at some point the connection to the stream will close – a disconnection. With realtime streams, it is important to understand that you may be missing data any time you are disconnected from the stream. Whenever a disconnection occurs, your client app must restart the process by establishing a new connection.

Additionally, to ensure that you do not miss any data, you may need to utilize a Redundant Connection, Backfill, or a Replay stream to mitigate or recover data from disconnections from the stream.


Consuming The Stream 

After you establish the connection to your stream, you will begin receiving a stream of data. The body of the response consists of a series of carriage-return (\r\n) delimited JSON-endocded activities, system messages, and blank lines.

Your client should use the \r\n character to break activities apart as they are read in from the stream. Keep in mind that your app should not wait for the end of the “document” to begin processing the data. In effect, the document is never-ending, and your client will need to read the data off the line as it arrives.

JSON Data

The individual data streamed by this API are JSON encoded, and fall into the following types:

  • Activities: Social data activities (e.g. “a Tweet”)
  • Keep-alive Signals: Carriage returns to prevent your connection from timing out
  • System messages: E.g. notification of a force disconnect (although note that the actual disconnection is accomplished via normal HTTP protocols, rather than through the message itself).

Note that the individual fields of JSON objects are not ordered, and not all fields will be present in all circumstances. Similarly, separate activities are not delivered in sorted order, and duplicate messages may be encountered. Keep in mind that over time, new message types may be added and sent through the stream.

Thus, your client must tolerate:

  • Fields appearing in any order
  • Unexpected or missing fields
  • Non-sorted activities
  • Duplicate messages
  • New arbitrary message types coming down the stream at any time

Activities

JSON Activities representing the social data activities for the given data source are provided in Gnip’s Activity Streams format, although in some cases, you may also choose the data source’s native data format. See here for an introduction to Activity Streams.

An activity will be contained on a single line may contain linefeed \n characters, but will not contain carriage returns.

For details on the format of a particular data source, see the links below:

System Messages

The Gnip stream may also contain system messages. Provided below is the basic format of these messages, along with some examples. Please note that the messages delivered could change, with new message being introduced. Client applications need to be tolerant of changing system message payloads.

Note that “sent” timestamps are in the YYYY-MM-DDTHH:mm:ssZZ format and in the UTC timezone.

Message Format:

{"<message type>":{"message":"<the message>","sent":"<date time sent>"}}

System ​​Message Types: There are currently 3 message types, each with a number of different messages. However, this is subject to change without notice.

  • Info - “Replay Request Complete” is sent when a Replay stream request finishes. See below.
  • Warn - E.g. if you’re connected to a PowerTrack stream which has zero rules.
  • Error - These may be sent when certain issues occur, e.g. “Error while streaming data”, or for force disconnects.

Note that error messages indicating a force disconnect for a full buffer may never get to your client, if the backup which caused the force disconnect prevents it from getting through. Accordingly, your app should not depend on these messages to initiate a reconnect.

Examples:

{"error":{"message":"Forced Disconnect: Too many connections. (Allowed Connections = 2)","sent":"2013-01-11T18:12:52+00:00"}}
{"error":{"message":"Invalid date format for query parameter 'fromDate'. Expected format is 'yyyyMMddHHmm'. For example, '201101012315' for January 1st, 11:15 pm 2011 UTC.\n\n","sent":"2013-01-11T17:04:13+00:00"}}
{"error":{"message":"Force closing connection to <description of connection> because it reached the maximum allowed backup (buffer size is <size>).","sent":"2013-01-11T17:04:13+00:00"}}

Keep-alive Signals

At least every 30 seconds, the stream will send a keep-alive signal, or heartbeat in the form of an \r\n carriage return through the open connection to prevent your client from timing out. Your client application should be tolerant of the \r\n characters in the stream.

If your client properly implements a read timeout on your HTTP library as described here, your app will be able to rely on the HTTP protocol and your HTTP library to throw an event if no data is read within this period, and you will not need to explicitly monitor for the \r\n character.

This event will typically be an exception being thrown or some other event depending on the HTTP library used. It is highly recommended to wrap your HTTP methods with error/event handlers to detect these timeouts. On timeout, your application should attempt to reconnect.

Gzip compression

Premium streams are delivered in compressed Gzip format. Your client will need to decompress the data as it’s read off the line. If you consume a low-volume stream, some libraries (e.g. Java’s GZIPInputStream) handle decompression of incoming data poorly, and will need to be overridden in order to decompress the activities as they are received, without waiting for a set threshold of data to be received. As an example of this, see our Java code here.

Chunked Encoding

Streaming API connections will be encoded using chunked transfer encoding, as indicated by the presence of a Transfer-Encoding: chunked HTTP header in the response. Because most HTTP libraries deal with chunked transfer encoding transparently, this document will assume that your code has access to the reassembled HTTP stream and does not have to deal with this encoding.

If this is not the case, note that activities and other streamed messages will not necessarily fall on HTTP chunk boundaries – be sure to use the delimiter defined above to determine activity boundaries when reassembling the stream.

Backup Buffering 

The stream will send data to you as fast as it arrives from the source firehose, which can result in high volumes in many cases. If Gnip’s system cannot write new data to the stream right away (for example if your client is not reading fast enough, or there is a network bottleneck, etc), it will buffer the content on its end to allow your client to catch up. However, when this buffer is full, a force disconnect will be initiated to drop the connection, and the buffered activities will be dropped and not resent. See below for more details.

One way to identify times where your app is falling behind is to compare the timestamp of the activities being received with the current time, and track this over time.

Although stream backups cannot ever be completely eliminated due to potential latency and hiccups over the public internet, they can be largely eliminated through proper configuration of your app. To minimize the occurrence of backups:

  • Ensure that your client is reading the stream fast enough. Typically you should not do any real processing work as you read the stream. Read the stream and hand the activity to another thread/process/data store to do your processing asynchronously.
  • Ensure that your data center has inbound bandwidth sufficient to accomodate large sustained data volumes as well as significantly larger spikes (e.g. 3-4x normal volume). For filtered streams like PowerTrack, the volume and corresponding bandwidth required on your end are wholly dependent on what you are tracking, and how many activities those filters return.

A simple way to test the speed of your app is to connect to the stream using your app, while also connecting in parallel via a redundant connection using curl. The curl connection will provide insight into the type of performance that would occur with your app code taken out of the chain, and allowing you to eliminate variables as you debug.

Volume Monitoring

Consider monitoring your stream data volumes for unexpected deviations. A data volume decrease may be symptomatic of a different issue than a stream disconnection. In such a situation, a stream would still be receiving the keep-alive signal and probably some new activity data. However, a significantly decreased number of activities should lead you to investigate whether there is anything causing the decrease in inbound data volume to your application or network, check status.gnip.com for any related notices, or to reach out to the Gnip support team to see if there are any upstream issues causing the data volume drop.

To create such monitoring, you could track the number of new activities you expect to see in a set amount of time. If a stream’s data volume falls far enough below the specified threshold, and does not recover within a set period of time, then alerts and notifications should be initiated. You may also want to monitor for a large increase in data volume, particularly if you are in the process of modifying rules in a PowerTrack stream, or if an event occurs that produces a spike in activities.

Because each of our customers have very different expectations around what “normal” data volume should be for their streams, we do not have a general recommendation for a specific percentage decrease/increase or period of time. We recommend that each of our customers determine their own metrics to use for this type of monitoring.

Furthermore, for PowerTrack streams, in order to keep a close watch and tight control over the volume of data your app is pulling in, your app should monitor the volume being consumed on a rule-by-rule basis in realtime or near-realtime. Additionally, we encourage you to implement measures within your app that will alert your team if the volume passes a pre-set threshold, and to possibly introduce other measures such as automated deletion of rules that are bringing in too much data, or disconnecting from the stream completely in extreme circumstances. These steps will help prevent the “runaway volume” scenario that could result in large data bills.


Disconnections 

For Gnip’s purposes, stream disconnections are grouped into two categories – client disconnections and forced disconnections.

Client Disconnects

Client disconnects occur when your application independently terminates the connection to the data stream, whether because your code actively closes the connection, or where network settings terminate the connection.

Common fallacies in client code that can create this behavior include:

  • Closing the connection, rather than allowing it to remain open indefinitely
  • Timing out the connection even though data (activities, system messages, or keep-alive signals) are still being sent through the conneciton
  • Decompressing the compressed stream using incorrect evaluation of chunk sizes
  • Network or firewall restrictions in your infrastructure can close your connection to the data stream (e.g. firewall session limits) – be sure to check for anything along these lines that might sever existing connections from time to time.

Force Disconnects 

Forced Disconnections occur when our system actively disconnects your app from the stream. There are 3 different types of forced disconnects.

  • Full Buffer – your app is not reading the data fast enough, or a network bottleneck is slowing data flow.
  • Too many connections – your app established too many simultaneous connections to the data stream. When this occurs, Gnip will wait 1 minute, and then disconnect the most recently established connection if the limit is still being exceeded.
  • Server Maintenance – the Gnip team deployed a change or update to the system servers. This will be accompanied by a “Gnip is closing for operational reasons” message.

Forced disconnects close your connection by sending a zero byte chunk in accordance with standard HTTP chunked-encoding practice. See Section 9 of this page for reference: http://www.httpwatch.com/httpgallery/chunked. Note that in the case of a full buffer, our app sends out the zero byte chunk, but your app may never receive it due to whatever is causing the data backup in the internet or within your app.

To account for this situation, be sure that your app is set to timeout/reconnect if it does not see any data (either an activity or our keep alive signal) for more than 30 seconds. They are registered as a “force disconnect” on your dashboard, in contrast to a “client disconnect” which is not initiated by Gnip’s servers.

To minimize forced disconnects for a full buffer, see Backup Buffering above.


Reconnecting 

When an established connection drops, attempt to reconnect immediately. If the initial reconnect attempt is unsuccessful, your client should continue attempting to reconnect using an exponential back-off pattern until it successfully reconnects.

Regardless of how your client gets disconnected, you should configure your app to reconnect immediately. If your first reconnection attempt is unsuccessful, we recommend that your app implement an exponential back-off pattern in subsequent reconnection attempts (e.g. wait 1 second, then 2 seconds, then 4, 8, 16, etc), with some reasonable upper limit. If this upper limit is reached, you should configure your client to notify your team so that you can investigate further.

Be sure that your app can handle all of the following scenarios:

  • Connection is closed by something on your end (client disconnect)
  • Connection is closed by Gnip with a zero-byte chunk (forced disconnect)
  • Your app has received zero data through the connection – i.e. no new activities AND no keep-alive carriage return signals – for more than 30 seconds. Your app must allow the connection to time out in this situation – note that by default, some clients persistent connections can remain open even without new data unless your code tells it to close based on an absence of the above signals. Note that unlike the third scenario above, if you are still receiving the keep-alive signal, but no new data, your app should remain connected – there may be an issue on the publisher’s side preventing data from being received by Gnip, or there may just not be any new activities that match your filters. Remaining connected as long as you are at least receiving the keep-alive signal ensures that you will receive new matching activities when they are received by Gnip. This is typically accomplished by setting the read timeout on your client to something beyond 30 seconds.

Missing Data from Disconnections

Any time you disconnect from the stream, you are potentially missing data that would be sent through at that time. However, Gnip provides multiple ways to mitigate these disconnects and recover data when they occur. See below for more information.

  • Redundant Stream - Consume the stream from multiple servers to prevent missed data when one is disconnected
  • Backfill - Reconnect within 5 minutes and start from where you left off
  • Replay - Recover data from within the last 5 days using a separate stream

Special Considerations for Replay Streams 

Replay allows Gnip customers to recover activities missed due to technical hiccups from a rolling window of historical data, and uses the Streaming API for data delivery. Full documentation for the Replay API can be found here.

However, your client must take into account the following requirements and special considerations when consuming data from a Replay stream.

Reconnections

Note that If you are disconnected from the stream during a Replay request, your app should not reconnect to the same URL used to initiate the request without altering the fromDate. Doing so will restart the request from the beginning and repeat the portion of time covered prior to disconnecting. Instead, upon being disconnected prior to the completion of a request, your client must adjust the fromDate and toDate in the URL to begin at the minute of the last activity you collected prior to the disconnection, to pick up from the same minute you left off, minimizing the amount of overlap.

Duplicates

We will not omit activities you may already have from a prior request, or your normal Power Track stream when delivering the results of a Replay request, and your app should deduplicate accordingly. Note, however, that for billing purposes, we deduplicate your activity counts across PowerTrack Replay requests that occur on the same day (based on UTC time) to ensure you are only billed once for each unique activity delivered via Replay. Activities delivered via the realtime Power Track stream are counted separately from those delivered via Replay.


Backfill 

Backfill is used to automatically recover data that is missed during brief disconnects when you reconnect within 5 minutes, and is available on PowerTrack and Firehose streams as an optional feature.

Once backfill is enabled on your stream, you access the feature by simply passing a parameter through each time you connect.

To use backfill, you need to add a “client=number” parameter to the end of the stream URL, where number is the connection number:

  • For a single connection, this must be client=1
  • For 2 connections, this must be client=1 and client=2, etc.
  • You’ll connect initially with this “client” parameter. When disconnected, you then reconnect with the same parameter. This allows the Gnip system to know which stream to backfill in cases of multiple connections.

An example URL for a production PowerTrack stream looks like:

 https://stream.gnip.com:443/accounts/XXXXXXXX/publishers/twitter/streams/track/prod.json

In this example, the URL for a single connection with backfill enabled would look like the following (note the “client=1” parameter):

 https://stream.gnip.com:443/accounts/XXXXXXXXXX/publishers/twitter/streams/track/prod.json?client=1

A second connection with backfill would be:

 https://stream.gnip.com:443/accounts/XXXXXXXXX/publishers/twitter/streams/track/prod.json?client=2

When the stream is disconnected and then reconnected within 5 minutes, the stream will begin with data from just before the disconnect. You’ll see a period of increased, but throttled volume until you have caught back up and are receiving data in realtime.

NOTE: An exception to the normal behavior occurs where the disconnect happened because your client was not able to keep up with the realtime stream (a force disconnect for a full buffer). In this case, the reconnect will begin at realtime with no backfill, even if the backfill parameter is used.


Redundant Streams 

A redundant stream or redundant connection simply allows you to establish more than 1 simultaneous connections to the data stream. This provides redundancy by allowing you to connect to the same stream with two separate servers, receiving the same data through both connections. Thus, your app has a hot failover for various situations, e.g. where one stream is disconnected or where your app’s primary server fails.

The number of connections allowed for any given stream is configurable by your account representative. To use a redundant stream, simply connect to the same URL used for your primary connection. The data for your stream will be sent through both connections, with both stream connections represented on the stream dashboard. If you are also using Backfill, the “client” should correspond to the number of connections you maintain – e.g. if you have 2 connections, one of them should use ‘client=1’ while the other uses ‘client=2’ so that the Gnip system can distinguish between them appropriately.

Note that for billing purposes, we deduplicate the activity counts you receive through multiple connections such that you are only billed for each unique activity once.