Historical PowerTrack provides access to the entire historical archive of public Twitter data – back to the first Tweet – using the same rule-based-filtering system as the realtime PowerTrack stream to deliver complete coverage of historical Twitter data.
Accessing data through Historical PowerTrack is accomplished through creating historical ‘jobs’ – a set of PowerTrack filtering rules and a historical time frame for which you would like to retrieve matching data from the Twitter archive. These jobs can be created and managed through the Historical PowerTrack API.
Each Historical PowerTrack job consists of the following steps.
- Create a new job for a time frame and set of PowerTrack rules using a request to the Historical PowerTrack API. Gnip will then sample the time period and generate a ballpark estimate of the expected data volumes, and time required to complete the job. The time period of interest is specified by providing ‘fromDate’ and ‘toDate’ timestamps in the UTC time zone.
- Either accept or reject the job, based on the estimate generated in Step 1. This is done with another request to the API. If the expected volumes or time required are outside of the expected range you may want to reject the job. Or, if the estimate is acceptable, the customer can accept the job, and the job will be run to generate the data files.
- When the job is complete, send a request to the API to retrieve the list of URLs that will be used to download the data files.
- Using the list of URLs obtained in Step 3, download the array of Twitter data files. These files each represent a 10-minute segment of the overall job, and are available for download for 15 days after the job has been accepted.
These steps are structured so that the process is easy to integrate into a customer’s existing workflow / tools. We have separated the quoting of and acceptance of a quote into distinct steps as we recognize the price of a Historical PowerTrack job can be significant and for many of our customers, those requesting data may not have the authority within their organization to spend this money.
Billing Note: Historical Powertrack jobs are counted for billing based on the day that they are “completed” - therefore if a job is accepted and starts running on August 31 and completes on September 1st, that job counts towards the September billing quota.
Each Historical PowerTrack job supports up to 1000 PowerTrack rules, subject to the normal restrictions on the number of operators and characters in each rule.
Historical PowerTrack supports the same rules and operators as realtime PowerTrack. Details about PowerTrack rules and operators are available here:
NOTE: There are some limitations for specific operators due to limitations in the source Twitter data in specific time periods.
If you plan to use any of the following operators, please see Caveats below for details.
Data for completed Historical PowerTrack jobs is delivered via an array of flat, gzip compressed JSON files. Each file represents a 10 minute segment of the overall time period requested for the job.
Tweets delivered through Historical PowerTrack can be provided in either Gnip’s Activity Streams format, or in Twitter’s native format. For details on the Activity Streams format for Twitter, see our documentation here. For documentation on Twitter’s native data format, see their documentation here.
Gnip Enrichments are available in historical Twitter data from the dates specified below, moving forward. Operators reliant on these enrichments are not supported for jobs with a timeframe prior to this date.
|Enrichment||Date of first availability|
|Klout Topics (limited)||08/01/2013|
|Enhanced Expanded URLs (URL Title and Meta Description)||07/29/2016|
|Poll Metadata (Original Format only)||02/22/2017|
Twitter’s Language Classification
Twitter’s native tweet-by-tweet language classification metadata is available in the archive beginning on March 26, 2013.
The url_contains operator will still function prior to 3/26/2012, but will only match against URLs as they are entered by a user into a Tweet and not the fully resolved URL (i.e. if a bit.ly URL is entered in the Tweet it can only match against the bit.ly and not the URL that has been shortened by bit.ly)
Native geo data prior to 9/1/2011 is not available in Historical Powertrack. As a result, all operators reliant on this geo data will not be supported for jobs with a timeframe prior to this date.
User Profile Data
All data prior to 1/1/2011 contains user profile information as it appeared in that user’s profile in September 2011. (e.g @jack’s very first Tweet in March 2006 contains his bio data from September 2011 that references his position as CEO at Square, which was not in existence at the time of the Tweet)
Followers and Friends Counts
All data prior to 1/1/2011 contains followers and friends counts equal to zero. As a result, any rules based on non-zero counts for these metadata will not return any results for a timeframe prior to this date.