What is Gnip?
posted on 13 March 2014 by stephen compston
Gnip is the world’s largest and most trusted provider of social data. With the most sources, the most customers, and the most robust infrastructure to deliver it all, Gnip is the Social Data API. Whether you want a full firehose of data, or want to filter that firehose down to a very specific subset of activities, Gnip has the products that allow you to get the social data you need.
Functionally, Gnip integrates into the backend of your system to deliver the raw social data that powers your app. Through our APIs, your app can define the content it needs, and have it delivered reliably and sustainably.
We also makes the social data being supplied by data publishers, make it more useful through data enrichments, and easier to work with through format normalization.
To integrate Gnip into your products, your app will need to be able to:
- Send HTTP requests with various specifications to interact with our APIs.
- Work with JSON-formatted data, as well as XML in some cases.
Our APIs serve a variety of access models – realtime streams, batches of recent data, individual activities, historical coverage – and numerous data sources. See our documentation on the particular API you are interested in for details on the types of requests your app will need to make, and data source documentation for details on consuming the data format into your app.
Gnip strives to retain backward compatibility throughout the system, including any point at which users’ code would interact with the API. This generally involves the Rules API and the various data consumption APIs detailed in Consuming Data via Gnip Console and Consuming Data via Enterprise Data Collector.
As a general principle, Gnip will not make any of the following types of changes without providing prior notification to its customers:
- Change to the documented process required to connect to and use any existing API endpoint
- Change to the existing URLs required to access any current customer’s API endpoints
- Deprecation of Enterprise Data Collector feeds which are currently in use by customer that serve data, so long as such deprecation is within the control of Gnip (Gnip is not responsible for third-party publishers who may deprecate their own feeds without warning to Gnip)
- Change that places more restrictive rate limits on any Supplier API endpoint, than those which are currently documented
- Change that places more restrictive limitations on rules, than those which are currently documented
- Change to the meaning of structured content within the data payload (Activity Streams format only)
- Change that removes elements or attributes (in the case of XML) from the data payload (Activity Streams format only), so long as such changes are within the control of Gnip (Gnip is not responsible for third-party publishers who may remove elements or attributes from their data payload without warning to Gnip)
Important Notes (PLEASE READ):
Elements or attributes MAY be added to the data payload in the Activity Streams format without prior notification. Furthermore, Gnip may introduce new system messages into the data stream without providing notice to the customer prior to such changes.
Your application MUST be tolerant of additive changes such as new elements or attributes being put into the payload, and new system messages. Many libraries and approaches will handle this without issue, but some do strict checking of the structure of a document and can thus cause issues without modification. Failing to implement tolerance for these types of changes can result in your code being broken.
In addition, content-oriented elements of the Activity Streams payload such as entry/title or entry/activity:object/content must be treated as opaque strings. In other words, trying to parse the content of a single element for structure to ascertain meaning may break code as it can be changed in the future without notice.
Instead, most of the elements you might want to get to will already be pulled out into more structural elements such as the user in the entry/activity:actor element or the tags in the entry/category elements.
Activity: A single event, action or result in the API response payload. Examples include a single Tweet, Facebook post, etc. See here for more information.
API (Application Programming Interface): The means through which software components interact with one another. Gnip interacts with various social media sources’ APIs to collect social media data. Gnip then normalizes the content and delivers it to customers through its own APIs.
Profile Location: The “Location” a user sets in their Twitter profile, intended to represent their “home” location. This is a plain string passed to Gnip by Twitter which can be anything the user wants to enter. This could be an actual location (“Denver, CO”) or something more ambiguous (“Planet Earth”). This is also the input for Gnip’s Profile Geo enrichment.
Backfill: A Gnip reliability product that helps ensure full data fidelity in the event of a short disconnection from PowerTrack and Firehose streams. When disconnected, if you reconnect to the stream within five minutes, you can reconnect from the point where you left off. Data missed from disconnections lasting longer than 5 minutes can be recovered using Replay or Historical PowerTrack.
Carriage Return: Character used in Gnip’s PowerTrack, Firehose, Historical PowerTrack, and other APIs to both delimit activities, and provide a keep-alive signal to prevent streaming API connections from timing out when there is no new social data to be delivered.
Complete Access: Through partnerships with certain sources, which grants Gnip full fidelity to the data, Gnip has Complete Access to: Twitter, Wordpress, Disqus, and IntenseDebate. The data available is “public” data, meaning no data which is restricted or made private by the publisher’s end users is delivered.
Data Cache: When referenced in this site, Data Cache refers to the feature of Gnip’s Data Collector for pubic APIs which stores any data collected by its feeds for a period of the current day, plus three additional days (UTC time, calendar days). This cache allows you to backfill the data collecting during this rolling window using GET requests for time periods where your client initially failed to successfully consume the data from the Data Collector.
Decahose: A 10% sample of all Tweets going through the Firehose. This is usually useful for use cases where the customer does not know ahead of time what they are looking for, and intend to identify interesting content as they go (e.g. trend identification). This is one of Gnip’s Firehose streams.
Enrichments: Additional data added to an activity’s payload by Gnip to make it more useful or easier to use. This includes: Profile Geo, basic and premium Klout, Expanded URLs, and Language Detection. These Enrichments must be enabled by your account manager and may involve additional costs.
Expanded URLs: An added Gnip Enrichment that attempts to provide the final destination URL (expanded URL) of a link that has been shortened by a link-shortening services.
Data Collector: Gnip’s product that provides managed access to public APIs by sending requests to those APIs on your behalf, and delivering the results to your app in a normalized format, and in the delivery protocol of your choice. Rather than building and maintaining connections to numerous public APIs, just connect to the Data Collector and let us handle the upstream connections and normalization.
Firehose: An unfiltered stream of data from one of Gnip’s complete access data sources. This can be a full firehose (every single activity), sampled (e.g. decahose), or one of our other options. This data is sent directly to you in real time, but does not support additional (e.g. keyword) filtering.
Gnip Console: Gnip’s web interface that allows you to view and manage your products and streams.
Historical Access: Historical Access refers to social data that is not delivered as it is happening in realtime. APIs that provide historical access include Historical PowerTrack, Search API, and Replay.
Historical Job: A single Historical PowerTrack request for historical data, defined by a particular set of PowerTrack rules and a specific historical time range which are applied to Gnip’s historical archive to deliver the matching data.
JSON: A data format used in Gnip streams, specifically in Complete Access sources. See http://json.org for information, and the documentation for specific complete-access sources’ data formats for sample payloads.
Metadata: Data about the data. This would include things like: the publish time of an activity, the location of an activity, information about the user publishing the activity, Gnip’s data enrichments, etc.
Operators: Special functions which represents the building blocks of a PowerTrack filtering Rule. Operators define the types of activities you want to receive through a stream (e.g. (has:links) – returns only Tweets containing links)
Payload: Generic term for the raw results obtained from one of Gnip’s APIs. In some cases, this refers specifically to all of the metadata associated with a single social data “activity”, e.g. a Tweet. However, it can also refer to raw data like the response delivered to indicate the current status of a running Historical PowerTrack job. For Gnip APIs, these may be in XML or JSON format.
Polling: A method of retrieving data from an API in which a client app sends consistent periodic requests for batches of recent data.
PowerTrack: Gnip’s product that offers an HTTP stream of real-time, filtered social data from a number of sources delivered to you in JSON format.
Profile Geo: A Gnip data enrichment that attempts to provide geocoding and normalization to the Profile Location (see above) provided by a Twitter user.
Public Access: Public Access data sources are social data providers (e.g. Facebook) which provide a public API with which to access social data. These data sources are accessed through Gnip with the Data Collector.
Public API: An API offered by a data source that gives the public access to its information and data. These APIs commonly require a developer account and corresponding access key to retrieve data from them.
Sources: Social data providers (e.g. Twitter, Facebook).
Streaming HTTP / Streaming Data Delivery: In streaming data delivery, data is continuously flowing to your app over a single persistent connection, without the need for repeated requests for batches of data. This is the model used by PowerTrack and Firehose streams, and is a delivery option for Data Collector feeds.
Rate Limits: The number of times a request can be made to an API in a specified amount of time. Most public APIs limit the number of requests that can be sent to it, which limits the ability to obtain full fidelity from such sources.
Redundant Streams: A second connection to a PowerTrack or Firehose stream. This delivers the same data as the original connection, allowing you to connect from two servers and guard against loss of data.
Replay: A recovery tool for when a connection goes down. Replay can go back in time to a certain moment and “relive” the stream as it happened in order. Replay is available for Twitter.
Rules: The syntax through which you define the activities to be delivered by a Gnip API. For public API sources, the available options for rules depend on what types of queries the source API supports. For complete-access sources, Gnip customers can use PowerTrack rules to filter data.
Search API: An API for delivering batches of data from the last 30 days in response to a query.
Substring: a string that is part of a longer string. For example: The substring “colo,” is present in both “Colorado” and “Color.” Many PowerTrack Operators offer substring matching on various fields (e.g. contains:).
Tags: Tags a means of identifying rules and the social activities they return with a logical grouping or internal ID. When an activity matches a rules which was created with a tag, the delivered payload will include a reference to that tag, enabling easy associations and processing (Activity Streams format only).
User Mention Stream: A Firehose stream which delivers all Tweets from the Twitter firehose which mention a Twitter user.
UUID (Universally Unique Identifier): A character string used to uniquely identify an object. Within Gnip, a UUID is used to identify individual Historical PowerTrack jobs.
XML: A markup language / data format that is both human-readable and machine-readable, and which is utilized for Activity Streams format in Data Collector streams.