• Home /
  • Articles /
  • Data & Rule Management with jq

  • Data & Rule Management with jq

    posted on 24 September 2014 by Josh Montague


    If you’ve spent time working with Gnip APIs, then you’re already familiar with the formatting used for PowerTrack Rules, and possibly also the JSON data (normalized Activity Streams or original format) data that our APIs can deliver. Having the right tools for working with JSON can mean a big increase in efficiency and time-saving. One of our favorite tools for working with this type of data is jq, a command-line JSON parser. This article is not intended to serve as a tutorial on using jq - the official documentation already does a good job of this - but to provide some tips and pointers for using jq for tasks related to interacting with the Gnip APIs.

    Inspecting Data

    The principle use of jq is command-line parsing and processing of JSON data. If you consume JSON data from Gnip APIs, then you’ve already got a pile of JSON to which you can apply these techniques. For larger, product-specific use of these JSON activities, you likely have a software system already built (maybe in Java, or Scala, or even Python) to efficiently parse and analyze these data. However, at times, you need to investigate JSON in a more ad hoc fashion.

    Consider investigating Twitter data that matches a new PowerTrack rule. If you collected some streaming data (or downloaded a file from a Historical request) and wrote it to disk in a file called tweets.jsonl (in the JSON Lines format), you could have a quick look at, for example, just the body of the Tweets (the text which the user wrote), with jq, like this:

    $ jq '.body' tweets.jsonl
    "RT @LutfuTurkkan: Merter Fatih kolejinin kapısında bu saatte iş makineleri, bahçeden yol geçirmek istiyorlarmış.\nZulmederek âbâd olmazsınız…"
    "Ver tu música y darte cuenta que tenes mas canciones en Ingles que en Español .-."
    "でし"
    "RT @highonyouluke: why the hell are people hating on Calum. he's an adorable little cutie pie and doesn't deserve hate. @Calum5SOS #staystr…"
    "@iwiyuki んふんふ(おはよ)"
    

    Similarly, you could quickly make a quick CSV-like output from a handful of payload elements like the username, when they joined the service, and the device they were using:

    $ jq '[.actor.preferredUsername, .actor.postedTime, .generator.displayName]' -c tweets.jsonl
    ["aknahm","2009-06-26T02:32:05.000Z","Twitter for iPhone"]
    ["CamiCastrillo","2012-03-02T00:08:39.000Z","Twitter Web Client"]
    ["_kumtecg_","2013-04-02T21:43:15.000Z","Twitter for Android"]
    ["hazza_hipss","2014-02-03T03:32:07.000Z","Twitter for Android"]
    ["nameko_nnf","2011-06-03T10:06:43.000Z","探偵協会"]
    

    For quick extraction of elements from JSON jq is easy-to-use and efficient.

    Debugging Invalid Rules

    Assume you created a PowerTrack rules.json file with values and tags on some topics of interest. Regardless of what whitespacing you choose, you can go back and forth between “pretty” formatting and compact formatting with jq command-line options:

    $ jq '.' -c rules.json
    {"rules":[{"value":"bieber OR beiber","tag":"music:artists"},{"value":"obama -iraq","tag":"politics:people"},{"value":"#ferguson -from:ferguson","tag":"events:current"},{"value":"stau lang:de","tag":"compare:traffic:danish"},{"value":"\"traffic jam\" lang:en","tag":"compare:traffic:english"}]}
    
    $ jq '.' rules.json
    {
      "rules": [
        {
          "value": "bieber OR beiber",
          "tag": "music:artists"
        },
        {
          "value": "obama -iraq",
          "tag": "politics:people"
        },
        {
          "value": "#ferguson -from:ferguson",
          "tag": "events:current"
        },
        {
          "value": "stau lang:de",
          "tag": "compare:traffic:danish"
        },
        {
          "value": "\"traffic jam\" lang:en",
          "tag": "compare:traffic:english"
        }
      ]
    }
    

    Using the PowerTrack rules APIs, you can POST your new rules to your stream and be streaming new data in no time. But, when your HTTP request returns a 400 Bad Request and complains of invalid JSON, it’s time to start debugging your rules. Having the rules displayed in the second format shown above can be extremely useful for debugging. If you happen to be working with even a modest sized rules payload, scanning through it by eye is not a good use of your time. jq can possibly help you, here, too. For example, if you forgot to escape a double quote in the "compare:traffic:english" rule, jq will let you know that there’s a problem and also point out which line to check:

    $ jq '.' rules.json
    parse error: Invalid numeric literal at line 20, column 36
    

    Note that the column number is less direct in pointing out the issue; it has to do with how jq attempts to parse the JSON. Similarly, if you forget the comma after the "politics:people" rule value, jq again points out the trouble line:

    $ jq '.' rules.json
    parse error: Expected separator between values at line 9, column 11
    

    And if you accidentally add an extra trailing comma – say after the final "lang:en" traffic rule – jq will give you a hint that it was expecting another object (that is, another rule value contained in a {...} structure):

    $ jq '.' rules.json
    parse error: Expected another array element at line 23, column 3
    

    Also, note that if your rules are being parsed from the compact format, the trouble line will always be “line 1”, but this is not helpful since the entire payload will be on line 1. So these error messages will be most useful on rules files in the “pretty” format.

    Building Rules From Plain Text

    There are, of course, many ways and languages that can be used to create PowerTrack rules for use with the Gnip APIs. Though jq is designed as a command-line JSON parser, it can be coerced into creating JSON output, too. For example, if you have a plain text file called rules.txt with one PowerTrack rule per line, we can create a simple, valid JSON rules payload using a couple of jq built-in functions and command-line options, like so:

    $ cat rules.txt
    dog lang:en
    cat sample:10
    from:gnip
    
    $ jq '. | split("\n") | map( {value: .} ) | {rules: .} ' -R -s rules.txt
    {
      "rules": [
        {
          "value": "dog lang:en"
        },
        {
          "value": "cat sample:10"
        },
        {
          "value": "from:gnip"
        }
      ]
    }
    

    That jq command might be a little cumbersome to type every time. To save yourself some effort, you can just create a shell alias to make it easier:

    $ alias jqrules="jq '. | split(\"\n\") | map( {value: .} ) | {rules: .} ' -R -s"
    $ jqrules rules.txt
    {
      "rules": [
        {
          "value": "dog lang:en"
        },
        {
          "value": "cat sample:10"
        },
        {
          "value": "from:gnip"
        }
      ]
    }
    

    If you find yourself using that alias often, add it to your shell’s startup routine (e.g. a .bashrc file) and you’ll always be just a short command away from converting a simple, newline-delimited set of PowerTrack rules into valid JSON for use with Gnip APIs. For an example of also including tags, see the footnote at the bottom.

    While jq is only one way of manipulating JSON data, it can be a very efficient way of inspecting and working with JSON data like the Activity Streams data served from – and PowerTrack rules used in – Gnip APIs.


    Footnote

    If your text file also includes tags for each rule separated by e.g. tabs you can work this into the jq shortcut, as well:

    $ cat rules.txt
    dog lang:en	dogs:english
    cat sample:10	cats:0.1
    from:gnip	user:gnip
    
    $ jq '. | split("\n") | map( split("\t") | {value: .[0], tag: .[1]} ) | {rules: .} ' -R -s rules.txt
    {
      "rules": [
        {
          "value": "dog lang:en",
          "tag": "dogs:english"
        },
        {
          "value": "cat sample:10",
          "tag": "cats:0.1"
        },
        {
          "value": "from:gnip",
          "tag": "user:gnip"
        }
      ]
    }
    

    You can just as easily use any other delimiter, too, by replacing the tab character ("\t") with the appropriate character for your needs!