- Setting up a Data Collector on Gnip
- Managing Rules on your Feeds/Streams
- Consuming Data from Gnip
- Data Source Details
Setting up a Data Collector on Gnip
To get started, your account or sales rep will need to create a Data Collector instance for your account and provide you with login directions.
Once logged into your data collector, you can create your first feed using the following steps:
- Click the button labeled "Start Adding Feeds".
- Select a data publisher and click on the type of data you would like.
- Configure the collector with requested info
- Add the rules you want Gnip to seed data collection with.
- Click "Create".
Click on the API Help tab of your collector to get your API URLs for deeper application integration.
That’s it. To add additional feeds, repeat steps 3 through 5. You may add up to 6 feeds per data collector.
Managing Rules on your Feeds/Streams
Please see documentation here.
Consuming Data from Gnip
Gnip provides you with options for consuming data from the social media APIs you want. Most importantly, a Gnip Enterprise Data Collector gives you the option to receive data in a push model instead of a pull model.
- POST (web hooks): You can register a URL for a gnip box to POST to when a new activity is found. Read more details about the webhook delivery at Outbound POST Documentation.
- Streaming HTTP: You can establish a persistent streaming HTTP connection with your Gnip box to consume data in real-time as your Gnip Data Collector finds it. Read more details about our Streaming HTTP support.
- GET: Poll your Gnip Data Collector. Your Gnip Data Collector exposes an XML endpoint for each data collector. To get the URL for this endpoint, click on the DATA tab for the collector. Read more about polling Gnip including optional query parameters to tune your polls and features of the results Gnip will return.
HTTP GET (Polling Gnip)
The simplest method for getting data from Gnip is to poll your data collector for the latest data. The most recent activities for each data collector will be returned.
Consuming Data via Polling
To retrieve your data via polling simply send a regular HTTP GET request to your data collector's activity URL. The URL for your collector can be found on the "Data" tab of the Gnip application for each collector.
When polling this URL the latest activities are returned, wrapped in a <results> XML element. An example follows (note that some extra information has been removed from the statuses in the example for brevity--the surrounding content is accurate).
Optional Query Parameters
Several query parameters are available for you to tune your polling.
Only return activities in the result that were retrieved since the given date. The date must be in the following format, always in UTC: YYYYmmddHHMMSS (year, month, date, hour, minute, second). Most users will not need to construct this parameter manually. Instead, use the refreshURL described below.
Note that this is the time the data was collected by your data collector, which may not be the same as the date from the originating source within the activity.
|to_date||Similar to since_date, but selecting activities that were retrieved before the given date.||none|
|max||The maximum number of activities that will be returned. It should be in integer format, up to 10000, with no punctuation (commas or periods).||100 if no since_date or to_date is given, 10000 if a since_date is given|
Using the refreshURL
In the example result above, the <results> element has an attribute named refreshURL. You can use this URL directly to get only activities since your last poll rather than having to construct the URL with the appropriate parameters each time.
In order to access your activities supply your login and password via Basic Authorization as part of your request. Note that if you use the non-SSL (e.g. 'http://') version of the URL, your password will NOT be encrypted.
Because Gnip returns a maximum of 10,000 activities at a time, you will need to poll your appliance fast enough that you don't miss activities. For example, if you poll once an hour, and you get around 15,000 activities per hour collected, you will not be able to get 5,000 of those activities. Many users will poll once a second or once a minute.
Number of Activities in Each Result
By default, if none of the optional query parameters outlined above are included in the URL, Gnip will return a max of 100 activities in each result. If the max query parameter is specified Gnip will return a max based on that parameter, up to a hard maximum of 10,000 activities. If the since_date parameter (and no max parameter) is specified Gnip will return a maximum of 10,000 activities in each result.
HTTP POST (Webhooks)
How to set up Outbound POST
To set up a Data Collector to deliver new activities to your app or system, enter the URL that will receive the POST requests in the Data Collector Settings tab.
Your app that is receiving/listening for the in-bound HTTP POST requests should store (and potentially synchronously process) incoming data that it receives. You can configure Gnip to send you data in either XML or JSON format. Gnip will send the activity in the same structure as it received from the originating API/service.
How to Debug Outbound POST
- A great utility to see activities as they are delivered from Gnip is Postbin. After configuring your data collector, go to http://postbin.org and get a "postbin" URL. Enter this postbin URL in the data delivery field for a data collector and postbin will log and display the raw content of each HTTP POST that your Gnip box sends to it.
- Each Data Collector that has Outbound POST setup has an "Outbound" tab enabled where you can see in realtime the results of each POST attempt by the data collector. This tab will show you the response codes that Gnip received from your app and how many new activities were sent with each POST.
- Make sure your network and application can receive inbound connection requests. Often a software/application developer will build in-bound HTTP POST funcionality into their application only to have all in-bound traffic stopped at the router/gateway tier of their network. Operations teams, and default firewall settings, often prohibit in-bound connections from occuring. If you suspect your application is never receiving data, consider the prospect that the data is never making it into your application to begin with.
Gnip batches activities into an outbound POST body, then delivers that batch to your POST enabled URL.
Currently Gnip does not retry POSTs if they fail for whatever reason.
Gnip will timeout a POST connection attempt if it is not opened within 2 seconds. Furthermore, if Gnip doesn't receive data back from the HTTP POST request within 60 seconds, it will timeout the connection and record an error within the Gnip application. This can often occur if your application's processing time is too long per request. See the "Processing Time" section for more info.
If your application is not processing the data it receives fast enough (e.g. if it's taking longer than 60 seconds per POST), then you'll either need to start dropping some percentage of the data on the floor or move to an asynchronous processing model. The general idea here is to digest the in-bound data off of the connection as quickly as possible, and return a response to Gnip. If your code is doing more work than can be accomplished in a relatively small window of time, please move processing out of band.
Currently Gnip does not handle HTTP redirects (e.g. 301 or 302) when sending HTTP POST requests. Be sure to enter the final, canonical, non-re-directing URL that you want to have consume the POSTs.
If you wish to have your Gnip box send data to a URL requiring authentication, you must use basic HTTP authentication and enter the URL in the Data Collector settings page in the formhttp[s]://username:firstname.lastname@example.org/path. Note that if you use the non-SSL (e.g. 'http://') version, your password will NOT be encrypted.
- If you have multiple data collectors sending data to the same application, you may want to encode the data source in the POST URL for each data collector. For example, Data Collector 'A' might POST to http://yourapplication.com/gnip/data-collector-a, and Data Collector 'B' might POST to http://yourapplication.com/gnip/data-collector-b.
- If you are consuming data from a high volume data source, consider using Streaming HTTP instead and save the cost of a web server process for each request.
Gnip provides access to the data with a streaming HTTP implementation that allows you to consume data in near real-time over a single persistent HTTP connection/request. In this model you do a normal HTTP request, but rather than closing the response stream after a chunk of data, Gnip will keep the response open and continue to send activities on that response.
Note that this method is sometimes called streaming comet as described in this wikipedia article, but in the Gnip case the "client" is not a browser. Instead it's typically another server connecting to the Gnip application for server-to-server communication.
Consuming Data via Streaming HTTP
To retrieve your data via streaming HTTP simply send an HTTP GET request to the stream.xml endpoint for your data collector. The stream.xml endpoint will be https://yourdatacollectorhost.com/data_collectors/<data_collector_id>/stream.xml. After you send the request, the Gnip application will start sending activities over the request and continue until the connection or request is closed.
Unlike in a standard HTTP request, your client should not wait for the request to end before processing data (because the request will "never" end). Instead, read the data from the response as it comes in, process the data (or store it somewhere for asynchronous processing), then continue reading from the response. New data will come over the same response stream as the Gnip application gets it.
Accessing the stream begins with a standard HTTP GET request, so authorization happens in the same manner as a regular poll. Specifically, to access the stream you must supply your Gnip application credentials via Basic Authorization. Note that if you use the non-SSL (e.g. 'http://') version, your password will NOT be encrypted.
In some cases, either because of network failures on one end or because of a lack of data coming through, the response may be closed. In this case simply reconnect to the same endpoint and start reading from the new stream. This typically looks like a simple "while" loop that when the connection closes it goes back to the start to re-initiate the process.
Consuming XML Data
Because data will continue to arrive over the same response stream you will need to know when to process a particular chunk of data. When using the HTTP stream Gnip sends the data over the stream with each activity as its own "top-level" node. In other words, activities will not be wrapped in an enclosing <results> element as they are during a normal poll.
Generally the best approach is to read the data and grab the opening element (e.g. the <entry> element when the activities are coming across in Atom format), then continue reading until you see the closing element for that tag (</entry>). At this point you've got a fully valid Atom XML entry that can be processed by any standard xml parser.
Chunking activities this way is the most straightforward means of processing the stream.
For example, the following figure shows what a stream may look like when used with a Twitter Actor - Notices data collector. In this example there are two <status> documents. Using the approach outlined above, your stream client would see the first <status> element, continue reading until it sees the first </status> element, then process that chunk as a single XML document. It would then continue reading at the second <status> element and process the second document when it sees the second </status> element.
The Enterprise Data Collector has a cache that holds data collected over the last 2 days from the feeds you have set up. This cache allows you to backfill the data collecting during this rolling window using GET requests for time periods where your client initially failed to successfully consume the data.
Backfilling Data from the Cache
If you miss data being collected by your Enterprise Data Collector (e.g. if you are disconnected from a streaming HTTP connection, your client fails to successfully poll for data for a given time period, an outbound POST fails), you can use the following process to backfill the missed time window:
- If you’re using an HTTP streaming connection, ensure your app reconnects and continues receiving the realtime data as it’s being collected.
- Once your normal data retrieval method is running normally again, execute GET requests for the time window you missed. The time window is defined by encoding since_date and to_date paramaters into the URL for your request, with the dates specified in UTC.
- When executing these requests, set the max parameter to 10000 activities. If any backfill request returns the full 10,000 activities allowed, you should split the time window into smaller segments and retry the requests, ensuring that you get all the data available for the given time window.
Data Source Details
Facebook limits the amount of data that can be extracted from the Graph API based on a combination of your IP address and Access Token. For this reason, two separate Facebook feeds, deployed on the same Gnip data collector will be viewed by the Graph API as the same entity, and will force them to compete for the same limited resources. We recommend the following practices to optimize your coverage of Facebook data:
- Ideally, each Facebook feed should be deployed on a separate Gnip data collector to ensure that the feeds do not compete for resources from Facebook.
- If you are using the Facebook Keyword Search feed, it is likely your highest-volume Facebook feed (depending on your rules). If possible, Facebook Keyword Search feed should not be deployed on the same Gnip data collector as another Facebook feed.
- If you are tracking very high-volume keywords with the Facebook Keyword Search feed, utilizing Keyword Search feeds on separate Gnip data collectors can help load-balance your rules, and provide better coverage of the available data.
- If you decide to use multiple Facebook feeds on a single Gnip data collector, you should consult with the Gnip support team to optimize the feeds and avoid being rate-limited by Facebook.
Obtaining a Facebook Access Token
To use the Facebook : Fan Page endpoint customers will need an "Access Token." The process to get an Access Token is currently a little cumbersome. Customers will first need to go to Facebook to create an application and register as a developer: http://developers.facebook.com/setup/ Facebook will give back an "App ID" and an "App Secret". Take these two values, plug them into the following command on any unix box:
curl -F type=client_cred \ -F client_id=app_id \ -F client_secret=app_secret \ https://graph.facebook.com/oauth/access_token?scope=offline_access
This will return an "Access Token" from Facebook that can be used in the Gnip UI to setup a Facebook : Keyword - Stream endpoint, make sure to copy all of it. Once you get an access token back, it is a good idea to verify that you have copied it correctly by running the command
curl -v "https://graph.facebook.com/search?q=cat&type=post&limit=2&access_token=access_token"
and verifying that Facebook returns data rather than a "Error processing access token" message.
Activity Streams Mapping
|Facebook Domain Object||Activity Stream verb||Activity Stream object-type|
*Note that XML Namespaces URI for Activity Stream verb and Activity Stream object-type is: http://activitystrea.ms/spec/1.0/.
Vimeo Oauth keys can be generated at: http://vimeo.com/api/applications/new
Flickr API keys can be generated at: http://www.flickr.com/services/apps/create/
To use the "Actor - Favorites" endpoint you will need to enter actor IDs, not names. Actor IDs look like "21943959@N07" and can be found by looking for the string "nsid" in the HTML source of Flickr pages.