• Home /
  • Articles /
  • Data Collector Video Walkthrough

  • Data Collector Video Walkthrough

    posted on 24 March 2014 by Leah Barren


    Just getting started with the Data Collector? The following video provides an overview of the Data Collector and how to use it to get started using public API data.


    Step-by-step Walkthrough

    • Step 1 - Get an Account
    • Step 2 - Add a Data Stream
    • Step 3 - Add Rules to Your Stream
    • Step 4 - Retrieve Data

    Step 1 - Get an Account 

    When your Data Collector is activated, your Gnip account representative will send you an email with the URL that hosts your collector. Navigate to the URL in a browser and click the “Get Account” link below the login form as shown in the screenshot.

    It is important to note that this instance only supports one set of user credentials. So, you’ll want to be sure and use a username and password that you feed comfortable sharing with your team.

    After successfully creating your account, you will be able to log into the main dashboard and begin adding streams.


    Step 2 - Add a Data Stream 

    When you first arrive on the dashboard, there will not be any data streams set up. Click on the “Start Adding Feeds” button.

    Next, you will see the full list of feeds available to you. To select a feed, click on a publisher, and then choose a specific feed for that publisher as shown below. Different types of feeds serve different types of use cases, and correspond to different types of queries and API endpoints on the publisher’s source API.

    Note that some of these are premium sources which deliver data from Gnip’s firehose sources. These are only available in Gnip’s premium console, and selecting them will instruct you to contact your account representative for access.

    Configuring Your Feed

    After selecting your feed, you will need to configure it with any required parameters before it begins collecting data. This includes adding at least one rule.

    Each feed requires specific types of rules, and failure to use the correct types of rules will result in improper data collection (e.g. user IDs v. keywords). Pay close attention to the type of feed you are selecting, and consult the Data Collector – Feed Info document provided by your account representative, which will indicate the proper inputs and outputs for each feed.

    Get Data

    Under “Get Data” –> “Advanced Settings” you can also configure how often your feed queries the source API for data (the “query rate”). Be sure that it is not configured to go beyond the publisher’s rate limits. If you do get rate limited, set this to something less frequent until the rate limiting stops.

    Output Format

    Choose between the publisher’s native data format, and Gnip’s Activity Streams format (XML for all Data Collector feeds).

    Default Delivery Method

    This is the method you would like to use to retrieve data that’s collected by your feed, and pull it into your system. Choose between a Streaming HTTP connection, where we push the data through a persistent connection, GET, where you poll the feed using GET requests on an ongoing basis, or POST, where the feed posts incoming data to a URL of your choosing.

    Access Token/API Key

    Some sources require an Access Token or API key to retrieve data from them. These feeds will have an extra form field where you can enter your token.

    Once you have configured everything, click “Create” at the bottom of the page.


    Step 3 - Add Rules to Your Stream 

    When you initially configured your stream, you chose some rules to initialize the stream with (where applicable). However, beyond that, you’ll want to add and remove additional rules to retrieve appropriate content from the source. There are two ways to manage rules on your Data Collector streams.

    Via the UI

    Navigate to the “Edit Settings” page of your stream in the dashboard. The text box shown on this page will display all of the rules currently in place for your stream. You can add additional rules (one per line), or modify the existing set and then click “Update” to make changes.

    This method is provided as a way to get started and test out the functionality of the stream initially. However, the text box and UI-based rule management is not supported after you exceed 1000 rules in your stream. In production settings, your app should manage rules via the API.

    Via the API

    On your dashboard, navigate to the API Help page for one of your filtered streams and scroll down to the “Manage your Rules” section of the page. There, you’ll see two URLs corresponding to the Rules API endpoints for this specific feed. If you will be uploading XML formatted rules, use the “.xml” endpoint, or, on the other hand, use the “.json” endpoint if you will be uploading JSON formatted rules. The URL should look something like:

    “https://XXXX.gnip.com/data_collectors/##/rules.xml”

    To test the functionality of adding a rule via the API, you can use a simple curl command with characteristics similar to the following examples. Please substitute your username and Rules API URL into the actual requests

    JSON:

    curl -v -X POST -uXXXXXX “https://XXXXX.gnip.com/data_collectors/##/rules.json” -d ‘{“rules”:[{“tag”:”tests”,”value”:”testrule”}]}’

    XML:

    curl -v -X POST -uXXXXXX “https://XXXXX.gnip.com/data_collectors/##/rules.xml” -d ‘rule_with_tagrule_without_tag

    If it returns a “201 Created” response, you were successful. You can also verify this by navigating to the Edit page of the stream, and checking the text box shown there.

    The API also supports methods for deleting and listing rules for your stream. See the full Data Collector Documentation for details.


    Step 4 – Retrieve Your Data 

    After adding your feed (or feeds), they will automatically start collecting data at the query rate you set up. The next step is to retrieve some of the data that is being collected. This is generally accomplished via the Default Delivery Method you chose for a specific feed.

    For more detailed on this step, see our documentation.

    Option 1: Polling with GET Requests

    To poll a feed for data using GET requests, navigate to the API Help page of the feed and retrieve the Polling URL as shown in the screenshot below.

    Send a GET request to this URL to retrieve some of the data currently stored in the cache. To test this functionality using curl, you can execute a curl command similar to the following in a command line prompt:

    curl -v -uXXXXXX “https://XXXXX.gnip.com/data_collectors/##/activities.xml”

    By default, polling this URL will give you the most recent 100 results collected by the feed. However, you can specify optional parameters (since_date, to_date, max) in the URL to make more targeted polling requests.

    RefreshURL

    Configure your app to repeat these requests and retrieve the data as it is being collected. To ensure that your app only retrieves newly-collected results from the feed’s data cache, append the refreshURL element provided in each response in as the Polling URL for your next request. The since_date in this URL will correspond to the time of your last retrieval.

    Option 2: Streaming HTTP Connection

    Establishing a streaming connection requires the same type of GET request as with polling, but the response body will continue indefinitely as the data collector continues to push new data through it. Simply execute a GET request to the Streaming URL shown on your API Help page, and watch as data is pushed through the connection. No since_date, to_date, or max parameters are required (or allowed) for streaming connections.

    If you are disconnected from the feed for any reason, the data that would have been pushed through the connection while you were disconnected will not be re-sent. Thus, it is important to configure your app to reconnect immediately upon being disconnected. This can generally be accomplished with a simple loop.

    Once you are reconnected, you can use GET requests in parallel to retrieve any data in the cache that you might have missed. See our documentation regarding the Data Cache and Backfilling for more information

    Option 3: Outbound POST

    If you configured your feed to POST data, you should have entered the URL the data will be posted to, and if necessary, basic authentication credentials to be sent with the POST requests. You will need to have a server set up to listen for the data at the URL you specified in your feed. As with the Streaming HTTP connection, if anything prevents a POST request from being delivered to your URL, the POST request will not be retried, and you will need to backfill the missing data using parallel GET requests as described in our documentation.