Data Collector Overview
Gnip’s Data Collector offers managed public API access to over a dozen social media sources. Unlike PowerTrack, which offers realtime filtering on full social data firehoses, the Data Collector sends requests for social data from public API endpoints on your behalf. The Data Collector makes the process of getting social data from public APIs easier for you by providing the appliance to collect and send social data directly to you in a normalized data format. By integrating a Gnip Data Collector into your product, you can focus on data analytics and best-serving your clients, rather than building and maintaining numerous public API integrations.
The most important thing to keep in mind, when it comes to your Data Collector, is that your Data Collector is a completely separate product from your PowerTrack streams. PowerTrack, and other products that are manageable directly from your console.gnip.com dashboard, are complete-access sources. This means that we have special agreements with these social media sources (such as Twitter) to serve their real time data in its entirety (with some limitations) to you. The Data Collector, on the other hand, offers managed public API access. This means that we do not have a special agreement with these social media sources, and instead we simply make requests on your behalf with the parameters that you have specified, and pass through the data that has been made available at the public endpoint, based on those parameters, to you.
Data Collector feeds are pre-configured to send requests to specific endpoints on the public API in question. For example, the Facebook Fan Page feed is set up to send requests to the endpoint of Facebook’s Graph API which supports retrieving posts and comments for a specific page. The feed uses the inputs you supply – e.g. page IDs, API token, in this case – to make the requests on your behalf, and passes through the data that is returned.
These requests are based on the unique requirements and nuances of the specific API, and are designed to comply with the API’s specifications. Further, they are subject to the limitations of the public APIs – rate limits and volume limits are observed. Rather than having to build an app yourself that accounts for these nuances, as well as any future changes in them, the Gnip appliance provides a normalization layer so that you can integrate with the Data Collector API, and leave the upstream integration maintenance to us.
Rules and Queries
The Data Collector simply passes through the queries you specify, and is not able to support types of queries which are not supported by the source API itself. We strongly recommend checking the documentation pages for our feeds specifically, as well as the source APIs documentation when investigating the support for specific types of queries on an API.
The rules used in your feed’s queries may be added programmatically through the API, as described in the API reference. This allows your app to update the list of rules you want to track through requests to the Data Collector API, without the need to log into the Data Collector dashboard.
The Data Collector provides further normalization through formatting it using our Activity Streams schema in many cases, with the option of using the source’s original format as instead. The normalized format provides a more consistent data structure and set of fields across different sources, which may be wildly different otherwise.
A common misconception is that the Data Collector works in the same way as our real time complete access sources. An important difference to keep in mind between managed public API access and our complete access sources, is how the data is delivered to you. Real time, complete access sources deliver the most recent data to you that matches your rules to your application via an established HTTP connection. On the other hand, Data Collectors collect the data that is made available via the public APIs, and stores this data in a data cache. From here, you must go retrieve your data from the cache. There are three options for data retrieval, and they are configurable on a feed-by-feed basis in your dashboard.
- GET (Polling) Requests: Retrieve your results in batches using GET requests
- HTTP Streaming: Have new results pushed to your app through an open connection as they are collected
- Outbound POST (webhooks): Have new results pushed to you in batches as they are collected using POST requests
In any of these scenarios, the Data Collector delivers the combined data for your queries. If your feed is collecting data for many different queries, then this data will be delivered to you together. Notably, in the Activity Streams format, Gnip provides metadata to indicate which query a specific result is associated with, which is useful when consuming the data into your app.
Data Collector Support
The Data Collector, as we have gone over, differs from our complete access products in a number of ways, including the amount of insight we have into API or data output changes, matching, and maintenance down time. This lack of insight does impact the type of support we are able to offer for Data Collectors, because of our inability to troubleshoot issues outside of the scope of our control.
With our complete access products, we have visibility and control into aspects of our own products, and open communication channels with our premium sources where we are made aware of any changes that might be coming down the pipeline or instability they might be having. On the other hand, because we don’t have exclusive relationships with the sources offered on the Data Collector, our visibility into API changes, downtime, or the data that is returned, is limited or non-existent. This isn’t to say that the support team won’t take a look at questions you have regarding your Data Collector. Rather it’s because there is no formal communication channels or partnership between Gnip and these sources, the support team relies on information that is made public via sources’ status notices, developer communities, forums and our own observations to infer answers when you have questions.
Although we don’t always have insight or definitive answers into why Data Collector sources return the results that they do or when the sources exhibit unpredictable behaviors, we do have insight into and can troubleshoot the health of the Data Collector itself. Gnip utilizes an internal monitoring system that will alert us to issues with individual Data Collectors as they come up, and we will resolve them accordingly.