Downloading Historical PowerTrack Files

posted on 20 May 2014 by Jim Moffitt


Introduction  

Historical PowerTrack (HPT) provides an API wrapper around a process that filters and outputs data from a tweet archive. Data are gathered in a multi-step process where the first step is creating a ‘job’ which specifies what time period to collect from and what filters to apply to the tweet archive.

The size of data returned by a job can be immense, both in the number of activities and in the storage size of the output payload. Jobs can produce millions of tweets requiring large amounts of storage space. In order to help ensure file sizes that are quick to download, data files are generated as a 10-minute time-series, with each file covering a ten-minute period. Depending on the data volumes associated with the job’s filters, even these 10-minute files can contain many thousands of tweets.

Since each hour of the job’s time-period can generate up to 6 files (10-minute periods can be ‘silent’ and have no data associated with them), the number of data files generated by a HPT job can be large. For example, a 90-day job can produce up to 12,960 files, one for each 10-minute period of those 90 days.

The data files that are generated are hosted at Amazon’s Simple Storage Service (S3), and are available for 15 days. These files are gzip-compressed JSON files and are based on the UTF-8 character set. All timestamps used in a Job description, included in API responses, used in filenames, and in the returned tweet data are in UTC.

When a job is complete, a list of download links is provided via the HPT API. Given that this list can contain thousands of links, some form of download automation is needed to retreive the data.

Below we discuss some fundamental details of the generated files, how to access your download link list, provide some options for automating the downloads, and provide other technical details that hopefully will be helpful when working with HPT data files.

Historical PowerTrack 2.0 and updated URLs  

With Historical PowerTrack 2.0, the endpoint URLs have been updated. With version 2.0, there are two changes:

  • The root host domain has been updated from historical.gnip.com to gnip-api.gnip.com.
  • The URL path as changed from /accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs{JOB_UUID}/ to /historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/

Here are the version 1.0 URLs:

  • Individual Job Status: https://historical.gnip.com/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/{JOB_UUID}.json
  • Download links as a JSON array: https://historical.gnip.com/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/{JOB_UUID}/results.json
  • Download links as a CSV: https://historical.gnip.com/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/{JOB_UUID}/results.csv

Here are the version 2.0 URLs:

  • Individual Job Status: http://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json
  • Download links as a JSON array: http://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.json
  • Download links as a CSV: http://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.csv

As discussed HERE, the HPT API provides a method to check on the status of a job. When a HPT job is created it is assigned a Universally-Unique ID (UUID), and this UUID is used to check on the status of the job of interest. To get the status of this job you make a GET request to the following HPT end-point, which references your Account name and the UUID of interest:

https://historical.gnip.com/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/{JOB_UUID}.json

Note that if you do not have the UUID you can make the following GET request to receive a list of all the jobs that have been submitted:

https://historical.gnip.com/accounts/{ACCOUNT_NAME}/jobs.json

When requesting the status of a specific job, the HPT API will respond with a JSON response body that includes “status” and “percentComplete” fields.

{
    "title": "WorldCup test",
    "account": "CustomerName",
    "publisher": "twitter",
    "streamType": "track",
    "format": "activity_streams",
    "fromDate": "201405171800",
    "toDate": "201405171900",
    "requestedBy": "me@there.com",
    "requestedAt": "2014-05-19T22:12:27Z",
    "status": "running",
    "statusMessage": "Job queued and being processed.",
    "jobURL": "https://historical.gnip.com:443/accounts/CustomerName/publishers/twitter/historical/track/jobs/kj4r26m0qx.json",
    "quote": {
        "estimatedActivityCount": 12000,
        "estimatedDurationHours": "1.0",
        "estimatedFileSizeMb": "5.0",
        "expiresAt": "2014-05-26T22:15:58Z"
    },
    "acceptedBy": "me@there.com",
    "acceptedAt": "2014-05-19T22:18:45Z",
    "percentComplete": 95
}

When a job has completed, the “status” field will be set to “delivered” and the “percentCompleted” will equal 100. Responses for completed jobs are also updated with a “results” section that indicates the number of activities that were compiled and the number of files that were generated. The “results” section also includes a “dataURL” field which is a separate link that provides the list of download links.

{
    "title": "WorldCup test",
    "account": "CustomerName",
    "publisher": "twitter",
    "streamType": "track",
    "format": "activity_streams",
    "fromDate": "201405171800",
    "toDate": "201405171900",
    "requestedBy": "me@there.com",
    "requestedAt": "2014-05-19T22:12:27Z",
    "status": "delivered",
    "statusMessage": "Job delivered and available for download.",
    "jobURL": "https://historical.gnip.com:443/accounts/CustomerName/publishers/twitter/historical/track/jobs/kj4r26m0qx.json",
    "quote": {
        "estimatedActivityCount": 12000,
        "estimatedDurationHours": "1.0",
        "estimatedFileSizeMb": "5.0",
        "expiresAt": "2014-05-26T22:15:58Z"
    },
    "acceptedBy": "me@there.com",
    "acceptedAt": "2014-05-19T22:18:45Z",
    "results": {
        "activityCount": 11860,
        "fileCount": 6,
        "fileSizeMb": "5.07",
        "completedAt": "2014-05-19T22:30:13Z",
        "dataURL": "https://historical.gnip.com:443/accounts/CustomerName/publishers/twitter/historical/track/jobs/kj4r26m0qx/results.json",
        "expiresAt": "2014-06-03T22:29:46Z"
    },
    "percentComplete": 100
}

To access the download list a GET request is made to the Data URL endpoint. The form of this Data URL is:

https://historical.gnip.com:443/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/{JOB_UUID}/results.json

The HPT API will respond with a JSON response body that includes a “urlList” field which is a JSON array of all the download links associated with the your job.

{
    "urlCount": 6,
    "urlList": [
        "https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/00_activities.json.gz?...",
        "https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/10_activities.json.gz?...",
        "https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/20_activities.json.gz?...",
        "https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/30_activities.json.gz?...",
        "https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/40_activities.json.gz?...",
        "https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/50_activities.json.gz?..."
    ],
    "expiresAt": "2014-06-03T22:29:46Z",
    "totalFileSizeBytes": 5324200
}

Note also that the download list (not the actual data!) is also available in a CSV format by replacing the “.json” extension of the Data URL with a “.csv” extension, as in:

https://historical.gnip.com:443/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/{JOB_UUID}/results.csv

When a GET request is made to the CSV endpoint, a file containing the list of download links is downloaded. Each line contains a file name and the corresponding Amazon S3 link. This CSV-formatted link list is used by the example cURL commands provided below.

'20140517-20140517_kj4r26m0qx_2014_05_17_18_00_activities.json.gz'	'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/00_activities.json.gz?...'
'20140517-20140517_kj4r26m0qx_2014_05_17_18_10_activities.json.gz'	'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/10_activities.json.gz?...'
'20140517-20140517_kj4r26m0qx_2014_05_17_18_20_activities.json.gz'	'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/20_activities.json.gz?...'
'20140517-20140517_kj4r26m0qx_2014_05_17_18_30_activities.json.gz'	'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/30_activities.json.gz?...'
'20140517-20140517_kj4r26m0qx_2014_05_17_18_40_activities.json.gz'	'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/40_activities.json.gz?...'
'20140517-20140517_kj4r26m0qx_2014_05_17_18_50_activities.json.gz'	'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/CustomerName/2014/05/19/20140517-20140517_kj4r26m0qx/2014/05/17/18/50_activities.json.gz?...'

Automating Downloads  

As mentioned above, this list can contain thousands of files. Therefore the downloading needs to be automated, and there are several ways to do this.

Download Files with cURL

cURL is a handy command-line utility for making HTTP requests. cURL is so useful you will notice that we provide sample cURL commands on the “API Help” tab of the console.gnip.com dashboard, as well as many examples in our support.gnip.com documentation. cURL is a great tool for exercising our many API-based products including Historical PowerTrack.

If you are working with Linux or Mac OS, cURL is most likely already available on your machine. If you are working on Windows, see HERE for a recipe for getting cURL installed. Also for Windows users, since the cURL examples below are built with Unix/Linux commands see HERE for getting a Linux emulator set up on your Windows box.

The following cURL command downloads all the results files in parallel into a local directory. It downloads an results.csv file (silently), and then downloads each file listed in it, printing out the command it will use as it does so. This command uses the first name column of results.csv as the file name. When using this command, be sure to use the CSV results file, rather than the JSON results file.

curl -sS -u<consoleuser>:<password> https://historical.gnip.com/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/{JOB_UUID}/results.csv | xargs -P 8 -t -n2 curl -o

If you have any issues with the above command (permission issues for example), it can be helpful to test with the following cURL copmmand which downloads only the first file.

curl -sS -u<user>:<password> https://historical.gnip.com/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/{JOB_UUID}/results.csv | head -1 | xargs -P 8 -t -n2 curl -o

While these cURL commands are very convenient they do have a disadvantage. If the download cycle is interrupted for any reason, these commands have no mechanism to start a new download cycle where the previous one stopped. Therefore, the following download tools may be more appropriate for your situation.

Downloading Files with a Bash Script

This bash script is another option for downloading Historical PowerTrack data files. One advantage of using this script is that it has ‘smarts’ when restarting a download cycle. If your download cycle gets interrupted for any reason, the script will inspect what files it has already downloaded and download only the files that are not available locally. The script writes files to a ‘downloads’ folder so it is important to keep files there until all the files have been downloaded.

Developing Custom Script/Application

You may decide to build your own script/application to manage HPT downloads. Perhaps you are moving up to a HPT Subscription and will need to download HPT data on a regular basis. If that is the case, here is some Ruby-like pseudo-code to illustrate the process, which also serves as a summary of the steps needed for downloading HPT data files:

    //HTTP GETgnip.com /histo completed job status.
    jobStatusResponse = http.GET(https://historical.gnip.com/accounts/{ACCOUNT_NAME}/publishers/twitter/historical/track/jobs/<uuid>.json)
    //Parse JSON into hash.
    jobStatus = JSON.parse(jobStatusResponse)

    //Grab the Data URL from the completed job status.
    dataURL = jobStatus['dataURL']
    //HTTP GET the Data URL.
    dataURLResponse = http.GET(dataURL)
    //Parse JSON into hash.
    dataURL = JSON.parse(dataURLResponse)
    //Grab the URL download list from the Data URL.
    fileList = dataURL['urlList']

    //Where to write files?
    out_box = "./"

    //Iterate through list of download links.
    for item in fileList do
         //Construct filename to write contents to.
         filename = out_box + get_name(item)  //Parse S3 link into a file name. In Ruby it looks like this: ((item.split(account_name)[1]).split("/")[4..-1].join).split("?")[0]


         //If you don't have file locally, go get the file.
         if not dir(filename) then
            //create file.
            hpt_file = File.open(filename)
            //GET contents.
            hpt_file.write = http.Get(item)
            hpt_file.close
         end
    end

Technical Details - Working with HPT Files  

Here are some high-level details that provide some technical background on the Historical PowerTrack (HPT) product and the data files it generates. This information will help you work with the data files after you have downloaded them or develop your our automation script/application.

  • Files are gzip compressed.
  • File contents are formatted in JSON.
  • File contents use the UTF-8 character set.
  • All timestamps in filenames and data are in UTC.
  • Each HPT job has an Universally Unique ID (UUID) associated with it, and this UUID is referenced when making API requests and is used to name the resulting files.
  • HPT generates a 10-minute time-series of files. A file is only generated if the ten-minute period it covers has activity.
  • All filename and tweet metadata timestamps are in UTC.
  • Data is encoded in JSON.
    • Individual activities are written as ‘atomic’ JSON objects, and are not placed in a JSON array.
    • Each file has a single “info” footer:
      {"info":{"message":"Replay Request Completed","sent":"2014-05-15T17:47:27+00:00","activity_count":895}}
  • Time periods start and include the ‘top’ unit of time and exclude the next ‘top’ unit of time. For example, the first hour of the day (00:00 - 01:00 UTC) would produce up to 6 files covering these 10-minute time periods:
    • 00:00:00-00:09:59 UTC
    • 00:10:00-00:19:59 UTC
    • 00:20:00-00:29:59 UTC
    • 00:30:00-00:39:59 UTC
    • 00:40:00-00:49:59 UTC
    • 00:50:00-00:59:59 UTC
  • Some planning numbers:
    • 6 files per hour.
    • 144 files per day.
    • 4,320 per 30-day month.
    • 52,560 files per year.

File-naming Conventions

  • HPT file names are a composite of the following details:
    • Job start date, YYYYMMDD
    • Job end date, YYYYMMDD.
    • Job UUID
    • Starting time of 10-minute period, YYYYMMDDHHMM.
    • A static “activities” string.
    • File extension of “.json.gz” (gzip-compressed JSON files).
<start_date>-<end_date>_{JOB_UUID}<10-min-starting-time>_activities.json.gz

For example, Given a Job UUID of gv96x96q3a covering a period of 2014-05-16 to 2014-05-20, the first hour of 2014-05-17 would produce the following 6 files:

  • 20140516-20140520_gv96x96q3a201405170000_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170010_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170020_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170030_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170040_activities.json.gz
  • 20140516-20140520_gv96x96q3a201405170050_activities.json.gz