Data Collector API Reference


Methods 

Method Description
GET /activities Connect to the polling endpoint for data retrieval.
GET /stream Connect to the streaming endpoint for data retrieval.
POST /rules Add rules to a Data Collector feed.
DELETE /rules Delete rules from a Data Collector feed.
GET /rules Retrieve the rules currently in place on a Data Collector feed.

Authentication 

All requests to the Data Collector APIs must use HTTP Basic Authentication, constructed from a valid email address and password combination used to log into your account at the url for your Data Collector. Credentials must be passed as the Authorization header for each request.


GET /activities 

Returns social data activities collected by the specified Data Collector feed.

To use this endpoint, your app will make repeated requests for new batches of data, based on the specifications below.

Request Specifications

Request Method HTTP GET
URL Found on the API Help page of your dashboard.
Character Encoding UTF-8


Optional Parameters

Several query parameters are available for you to tune your polling.

Parameter Description Default Limits
since_date Only return activities in the result that were collected after the date provided in UTC time, using the format  YYYYmmddHHMMSS (year, month, date, hour, minute, second). 

Most users will not need to construct this parameter manually.  Instead, use the refreshURL described below.
none The Data Collector retains data it collects for an additional 3 days after collection, so since_date should correspond to this timeframe.

Using this and the to_date parameter allows you to go back and retrieve data that was collected up to 3 days ago.
to_date Similar to since_date, but selecting activities that were retrieved before the given date. If none is specified, the time of the request serves as the toDate. The Data Collector retains data it collects for an additional 3 days after collection, so to_date should correspond to this timeframe.

Using this and the since_date parameter allows you to go back and retrieve data that was collected up to 3 days ago.
max The maximum number of activities that will be returned.  It should be in integer format, up to 10000, with no punctuation (commas or periods). 

Note that if a poll request returns a number of activities equal to the max parameter for your request, additional data may have been collected that you failed to retrieve with this request, and your app should accommodate this by breaking the previous request into smaller segments until the number of activities for each is less than the max parameter.
100 if no since_date or to_date is given

10000 if a since_date is given
1 to 10000

Response

When polling this URL the latest activities are returned, wrapped in a <results> XML element.  An example follows (note that some extra information has been removed from the statuses in the example for brevity–the surrounding content is accurate).

<results data_collector_id="1" publisher="Facebook" endpoint="Keyword - Search" refreshURL="https://example.gnip.com/data_collectors/1/activities.xml?max=10000&since_date=20130816005000">
    <entry>
        ...first activity...
    </entry>
    <entry>
        ...second activity...
    </entry>
</results>

Using the refreshURL

In the example response shown above, the <results> element has an attribute named refreshURL.  Your client can use this URL directly in the next polling request to only retrieve activities that have been collected since your last poll, rather than having to construct the URL with the appropriate parameters each time. This URL will specify a since_date corresponding to the time of your last poll request, and a “max” parameter equal to the one used for your previous request.

Polling Frequency

Because Gnip returns a maximum of 10,000 activities at a time, you will need to poll your appliance fast enough that you don't miss activities.  For example, if you poll once an hour, and you get around 15,000 activities per hour collected, you will not be able to get 5,000 of those activities.  Many users will poll once a second or once a minute.

Data Cache 

The Data Collector has a cache that holds data collected in the current day and 3 previous days.  This cache allows you to backfill the data collecting during this rolling window using GET requests for time periods where your client initially failed to successfully consume the data.

If you miss data being collected by your Data Collector (e.g. if you are disconnected from a streaming HTTP connection, your client fails to successfully poll for data for a given time period, an outbound POST fails), you can use the following process to backfill the missed time window:

  • If you’re using an HTTP streaming connection, ensure your app reconnects and continues receiving the realtime data as it’s being collected.
  • If you’re using an HTTP streaming connection, ensure your app reconnects and continues receiving the realtime data as it’s being collected.</li>
  • Once your normal data retrieval method is running normally again, execute GET requests for the time window you missed.  The time window is defined by encoding since_date and to_date parameters into the URL for your request, with the dates specified in UTC.</li>
  • When executing these requests, set the max parameter to 10000 activities.  If any backfill request returns the full 10,000 activities allowed, you should split the time window into smaller segments and retry the requests, ensuring that you get all the data available for the given time window.</li>

Example Request URL:

https://XXXXX.gnip.com/data_collectors/##/activities.xml?since_date=20120601053000&amp;to_date=20120601053100&amp;max=10000

A GET request to this URL would request data collected by data collector ## between 05:30:00 and 05:31:00 UTC on June 1, 2012, and would pull down up to 10,000 activities. If your GET request returned exactly 10,000 activities, you should split this URL into two time segments and re-run the requests to make sure you get all the data available. For example:

https://XXXXX.gnip.com/data_collectors/##/activities.xml?since_date=20120601053000&amp;to_date=20120601053030&amp;max=10000
https://XXXXX.gnip.com/data_collectors/##/activities.xml?since_date=20120601053030&amp;to_date=20120601053100&amp;max=10000

GET /stream 

Opens a persistent connection to the Data Collector feed, with social data sent through the open connection as it is collected from the source API.

This is an alternative to the GET Polling method described above. In this model you do a normal HTTP request, but rather than closing the response stream after a chunk of data, Gnip will keep the response open and continue to send activities on that response.

Unlike in a standard HTTP request, your client should not wait for the request to end before processing data (because the request will "never" end).  Instead, read the data from the response as it comes in, process the data (or store it somewhere for asynchronous processing), then continue reading from the response.  New data will come over the same response stream as the Gnip application gets it.

Request Specifications

Request Method HTTP GET
Connection Type Keep-Alive

This should be specified in the header of the request.
URL Found on the API Help page of your dashboard.
Character Encoding UTF-8

Reconnecting

In some cases (e.g. because of network failures on one end), the response may be closed.  In this case simply reconnect to the same endpoint and start reading from the new stream.  This typically looks like a simple "while" loop that when the connection closes it goes back to the start to re-initiate the process.

However, note that data collected by the Data Collector while you are disconnected from the API is not delivered through the streaming connection upon reconnecting. To get any data collected during a period of disconnection, see here on backfilling data from the Data Collector cache.

Consuming XML Data

Because in a streaming connection, data will continue to arrive over the same response stream you will need to know when to process a particular chunk of data.  When using the HTTP stream Gnip sends the data over the stream with each activity as its own "top-level" node.  In other words, activities will not be wrapped in an enclosing <results> element as they are during a normal poll.

Generally the best approach is to read the data and grab the opening element (e.g. the <entry> element when the activities are coming across in Atom format), then continue reading until you see the closing element for that tag (</entry>).  At this point you've got a fully valid Atom XML entry that can be processed by any standard xml parser. Chunking activities this way is the most straightforward means of processing the stream.


Outbound POST Data Delivery 

The third option for data delivery is to have batches of data sent to a server you have set up to listen for it, via HTTP POST requests. To set up a Data Collector to deliver new activities to your app or system, enter the URL that will receive the POST requests in the Data Collector Settings tab. Your app that is receiving/listening for the in-bound HTTP POST requests should store (and potentially synchronously process) incoming data that it receives. You can configure Gnip to send you data in either XML or JSON format.  Gnip will send the activity in the same structure as it received from the originating API/service.

If you have multiple data collectors sending data to the same application, you may want to encode the data source in the POST URL for each data collector. For example, Data Collector 'A' might POST to http://yourapplication.com/gnip/data-collector-a, and Data Collector 'B' might POST to http://yourapplication.com/gnip/data-collector-b. If you are consuming data from a high volume data source, consider using Streaming HTTP instead and save the cost of a web server process for each request.

Debugging Outbound POST

  • A great utility to see activities as they are delivered from Gnip is Postbin.  After configuring your data collector, go to http://postbin.org and get a "postbin" URL.  Enter this postbin URL in the data delivery field for a data collector and postbin will log and display the raw content of each HTTP POST that your Gnip box sends to it.
  • Each Data Collector that has Outbound POST setup has an "Outbound" tab enabled where you can see in realtime the results of each POST attempt by the data collector.  This tab will show you the response codes that Gnip received from your app and how many new activities were sent with each POST.
  • Make sure your network and application can receive inbound connection requests. Often a software/application developer will build in-bound HTTP POST funcionality into their application only to have all in-bound traffic stopped at the router/gateway tier of their network. Operations teams, and default firewall settings, often prohibit in-bound connections from occuring. If you suspect your application is never receiving data, consider the prospect that the data is never making it into your application to begin with.

Implementation Specifications

Multiple Activities Method Gnip batches activities into an outbound POST body, then delivers that batch to your POST enabled URL.
Retry Efforts Currently Gnip does not retry POSTs if they fail for whatever reason.
Timeouts Gnip will timeout a POST connection attempt if it is not opened within 2 seconds. Furthermore, if Gnip doesn't receive data back from the HTTP POST request within 60 seconds, it will timeout the connection and record an error within the Gnip application. This can often occur if your application's processing time is too long per request. See the "Processing Time" section for more info.
Processing Time If your application is not processing the data it receives fast enough (e.g. if it's taking longer than 60 seconds per POST), then you'll either need to start dropping some percentage of the data on the floor or move to an asynchronous processing model. The general idea here is to digest the in-bound data off of the connection as quickly as possible, and return a response to Gnip. If your code is doing more work than can be accomplished in a relatively small window of time, please move processing out of band.
Redirects Currently Gnip does not handle HTTP redirects (e.g. 301 or 302) when sending HTTP POST requests. Be sure to enter the final, canonical, non-re-directing URL that you want to have consume the POSTs.
Authentication If you wish to have your Gnip box send data to a URL requiring authentication, you must use basic HTTP authentication and enter the URL in the Data Collector settings page in the formhttp[s]://username:pass@domain.com/path.  Note that if you use the non-SSL (e.g. 'http://') version, your password will NOT be encrypted.

POST /rules.[xml json] 

Adds the specified rule or rules to a Data Collector feed, to be used in querying the source API. Rules may be specified in either XML or JSON format.

Note: Query and Filtering capabilities depend on the source API, and are not the same as those available in PowerTrack. For details on the query functionality for a specific source, see its documentation in the sources section.

Request Method HTTP POST
Content Type "application/xml" or "application/json". The request should specify this as the "Content-type".
URL Found on the API Help page of your dashboard.
Character Encoding UTF-8
Request Body Format JSON or XML

When adding rules, if a rule with the same value already exists on the data collector the new (second) one will simply be ignored and not added. Thus, to update a tag, delete the rule then add the rule again with the new tag. A rule may only have a single tag with a maximum length of 255 characters.

Rule Tags

When they are created, rules can be given tags. Rule tags have no special meaning, they are simply treated as opaque strings carried along with the rule.

Practically, tags are a simple way to create logical groupings for your rules. For example, a set of 500 rules you create for a single customer can be given the same tag. Additionally, in Gnip’s Activity Streams data format, the tag is included in the <gnip:matching_rule> element in each activity returned. Thus, for each result, your system can easily use the tag to correlate the result with an internal ID as it is returned.

Tags cannot be updated on an existing rule. In order to "update" a tag, you need to first remove the rule, then add it again with the updated tag.

Request Body Example

XML

Content-type: "application/xml"
<rules>
    <rule tag="XML-encoded-TagBelongsHere">
        <value>XML-encoded-RuleBelongsHere</value>
    </rule>
    <rule>
        <value>XML-encoded-AnotherRuleHere</value>
    </rule>
    ...
</rules>

JSON

Content-type: "application/json"
{
    "rules":
    [
        {"value":'rule1","tag":"tag1"},
        {"value":"rule2"}
    ]
}

Example Curl Requests

The following example request demonstrates how to add rules using cURL on the command line, using XML rules and JSON respectively.

curl -v -X POST -uexample@customer.com "https://example.gnip.com/data_collectors/1/rules.xml" -d '<rules><rule tag="XML-encoded-TagBelongsHere"><value>XML-encoded-RuleBelongsHere</value></rule><rule><value>XML-encoded-AnotherRuleHere</value></rule></rules>'
curl -v -X POST -uexample@customer.com "https://example.gnip.com/data_collectors/1/rules.json" -d '{"rules":[{"value":"rule1","tag":"tag1"},{"value":"rule2"}]}'

Responses

  • 20x on Success
  • 40x on Failure. Failures include a corresponding error message in the HTTP response body.

Example Responses (excerpted)

HTTP/1.1 201 Created
{"response":{"message":"created"}}
HTTP/1.1 400 Bad Request
{"response":{"rule_example":"{\"rules\":[{\"tag\":\"a_tag\",\"value\":\"rule_with_tag\"},{\"value\":\"rule_without_tag\"}]}","message":"Failure: Poorly formatted JSON. You provided something like: '{\"rules\":[{valu'. Expected format in rule_example element."}}

Notes

  • The Request body size of a rule update is limited to 100KB.
  • The rule value and rule tag are limited to 255 characters each.
  • The tag attribute is optional as shown above by one having it and another not having it.
  • Best practice is to incrementally add and remove changes to your rule list instead of deleting/recreating the entire list each time a change is necessary.
  • New Rules are placed at the top of the queue for polling and are processed in a LIFO manner.

DELETE /rules 

Deletes the specified rule or rules from a Data Collector feed. Data already collected from this rule will not be removed from the cache.

Request Method HTTP DELETE (optionally you can use the "_method=delete" query parameter with the POST method; see "Notes" below)
Content Type "application/xml" or "application/json". The request should specify this as the "Content-type".
URL Found on the API Help page. See [here]() for more info.
Character Encoding UTF-8
Request Body Format JSON or XML

Request Body Content

XML

Content-type: "application/xml"
<rules>
    <rule tag="XML-encoded-TagBelongsHere">
        <value>XML-encoded-RuleBelongsHere</value>
    </rule>
    <rule>
        <value>XML-encoded-AnotherRuleHere</value>
    </rule>
    ...
</rules>

JSON

Content-type: "application/json"
{
    "rules":
    [
        {"value":"rule1","tag":"tag1"},
        {"value":"rule2"}
    ]
}

Example Curl Request

The following example request demonstrates how to add rules using cURL on the command line.

curl -v -X DELETE -uexample@customer.com "https://example.gnip.com/data_collectors/1/rules.json" -d '{"rules":[{"value":"rule1","tag":"tag1"},{"value":"rule2"}]}'

Responses

  • 20x on Success. However, note that this response simply indicates that the JSON was valid and accepted by the API. It does not reflect whether a match was found for the rule(s) specified in the request (i.e. whether there was a matching rule to delete).
  • 40x on Failure. Failures include a corresponding error message in the HTTP response body.

Example Response (excerpted)

HTTP/1.1 202 Accepted
{"response":{"message":"accepted"}}

Notes

  • This method is also available by sending a POST request to the same URL with the URL query parameter "_method=delete" (e.g. for GAE and other environments that don't allow HTTP method 'DELETE' method on HTTP requests).

GET /rules 

Retrieves the list of all rules currently existing on a Data Collector feed.

Request Method HTTP GET
URL Found on the API Help page. See [here]() for more info.
Request Body Format N/A

Example Request

curl -v -X DELETE -uexample@customer.com "https://example.gnip.com/data_collectors/1/rules.json" -d '{"rules":[{"value":"rule1","tag":"tag1"},{"value":"rule2"}]}'

Responses

  • 20x on Success
  • 40x on Failure. Failures include a corresponding error message in the HTTP response body.

Example Response

HTTP/1.1 200 OK
{"rules":[{"value":"gnip"},{"value":"social"},{"value":"data"}]}