• Home /
  • Articles /
  • Converting Data from JSON to CSV

  • Converting Data from JSON to CSV

    posted on 30 November 2015 by Jim Moffitt


    Converting Data from JSON to CSV

    json2csv

    Customers often ask us about converting Tweet JSON into comma-separated values (CSV). These customers have received Twitter data from a Gnip Product such as Historical PowerTrack, 30-Day Search or Full-Archive Search, which all encode Tweets in JSON. The reasons for this are numerous, from wanting to work with a sample of the data in a spreadsheet, or needing to import the data into a relational database or legacy system. The CSV format’s value is in its simplicity, and most software systems are able to import it.

    To help with this process, the json2csv app is available. This tool manages the conversion of Gnip Activity Stream (AS) JSON to the comma-separated values (CSV) format.

    Tweet attributes of interest are indicated by referencing a Tweet Template of choice. If the Tweet Template has an attribute it will be written to the output CSV files. If the Template does not have the attribute, it is dropped and not written. You can design your own Tweet Template, or use one of the provided example Templates.

    This tool works with an input folder and attempts to convert all *.json and *.json.gz files it finds there, writing the resulting CSV files to an output folder. This tool will work with Tweet JSON produced with both Gnip Historical PowerTrack and Search API products. Thus, this tool was designed to convert JSON Tweets in bulk.

    Before deciding to perform this type of conversion, you should consider the following trade-offs:

    1. JSON data are multi-dimensional, with multiple levels of nested data. However, CSVs are two dimensional. Converting from JSON to CSV means that you are sacrificing detail and flexibility in the data by either flattening it, or discarding some fields from the data.
    2. If you are consuming the data into a custom app, retaining the data in JSON provides a level of flexibility not available with CSVs. For example, field order is not important in JSON, while column order in CSVs is very important (and therefore, arguably more fragile).
    3. It is recommended to save the source JSON. It represents all the available metadata. You may decide to go back and convert metadata you did not include the first time.

    Getting started

    To set-up the tool (installing some common gems, and running Ruby code), head on over to the json2csv Github repository and its README.

    To further help you get started, here are some common questions about the JSON to CSV conversion process:

    I need the JSON data converted to CSV. What do I need to do?

    First, follow these instructions to get the json2csv tool set up in your environment. The tool was developed in Ruby, so you need a Ruby interpreter.

    Second, you need a Tweet Template which identifies all the Tweet attributes that you are interested in. The conversion process uses this template and creates a CSV file with a column for every attribute in the Template. The conversion process represents an opportunity to ‘tune’ what you want to export. For example, the standard Twitter metadata includes the numeric character position of hashtags in a Tweet message. You may decide that you do not need this information, and therefore can omit those details from your Tweet Template.

    How are CSV column names determined?

    CSV column names are generated by referencing the JSON object names and using dot notation to indicate levels and arrays of attributes. For example, here are some JSON attributes:

    {
        "id": "tag:search.twitter.com,2005:418130988250570752",
        "actor": {
            "id": "id:twitter.com:17200003"
        },
        "twitter_entities": {
            "hashtags": [
                {
                    "text": "HappyNewYear",
                }
            ]
        }
    }
    

    Using dot notation, these attributes would be represented with the following header:

    id, actor.id, twitter_entities.hashtags[0].text
    

    What if I want special names for certain columns?

    With dot notation and the many levels of JSON attributes the auto-generated names can become quite long. Therefore you may want to override these defaults with shortened versions. The conversion process already has some ‘re-mappings’ built-in, and others can be added when wanted. Here are some examples:

    twitter_entities.hashtags.0.text               --> hashtags
    twitter_entities.urls.0.expanded_url           --> twitter_expanded_urls
    twitter_entities.user_mentions.0.screen_name   --> user_mention_screen_names
    gnip.matching_rules.0.value                    --> rule_values
    

    Since the header for hashtags is automatically overridden, the resulting header for the above Tweet JSON is:

    id, actor.id, hashtags
    
    Is there any special parsing of JSON values?

    Yes, Tweet and Actor IDs are handled specially. For these IDs, the string component is stripped off and only the numeric
    part is retained.

    For example, this JSON…

    {
      "id": "tag:search.twitter.com,2005:418130988250570752",
      "actor": {
        "id": "id:twitter.com:17200003",
      }
    }
    

    … would become this CSV:

    id, actor.id
    418130988250570752,17200003
    

    How are arrays of metadata handled?

    Twitter metadata include many attributes that have a variable length list of values. Examples include hashtags, user mentions, and URLs. These example metadata are stored in a JSON object named ‘twitter_entities’ and each attribute is an array of length zero or more. Arrays in CSV are stored with double-quotes around a comma-delimited list of values. So, these arrays can be thought of as a list within another list.

    Using the hashtag example, multiple hashtags would be inserted into a single column with the ‘hashtags’ column header (relying on the built-in header override discussed above):

    id,actor.id,hashtags
    418130988250570752,17200003,"HashTag1,HashTag2"
    

    How long will the conversion process take?

    It depends on how many files are being processed, how many Tweets are being converted, and how many attributes are included in the Tweet Template. If there are 10 million Tweets, and 200 Tweet attributes in the Template, there are 2 billion attributes to process.

    Using a standard template Tweet approximately 5 million Tweets can be processed per hour. Large datasets can take hours to process.

    Tweet Templates 

    A Tweet Template is an example Tweet payload in JSON that contains all the fields you want to export to the CSV files. Social activities, such as Tweets, are dynamic in nature and the payloads from one Tweet to another are sure to be different. One could be a geo-tagged Tweet with several hashtags and mentions, while the next one is a Retweet with an URL.

    This example Tweet is referred to as the conversion ‘Tweet Template.’ The conversion process loads this template and then tours each of your historical Tweets and exports all metadata that is specified in the Template. Here is a short example Template that would export the bare minimum of metadata:

    {
      "id": "tag:search.twitter.com,2005:418130988250570752",
      "actor": {
        "id": "id:twitter.com:17200003",
        "preferredUsername": "jimmoffitt",
      },
      "verb": "post",
      "postedTime": "2013-12-31T21:26:10.000Z",
      "body": "Example Tweet #HashTag1 #HashTag2"
       "twitter_entities": {
        "hashtags": [
          {"text": "HashTag1"},
          {"text": "HashTag2"}
        ]
    }
    

    This would be represented in a CSV file as:

    id,actor.id,preferredUsername,verb,postedTime,body,hashtags
    418130988250570752,17200003,jimmoffitt,post,2013-12-31T21:26:10.000Z,Example Tweet #HashTag1 #HashTag2,"HashTag1,HashTag2"
    

    A couple things to note about this JSON to CSV conversion:

    • Tweet and User IDs have been stripped down to just the numeric content.
    • Dot notation is used to preserve hierarchy when needed. In this case it was used to handle the repeated use of ‘id’.
    • Dot notation names can be overridden (such as hashtags in this example).
    • Arrays are stored as comma-separated values inside double quotes.

    Example Tweet Templates

    It can be difficult and time-consuming to find just the perfect Tweet ‘in the wild’, an actual Tweet that encapsulates all metadata you care about. So, you may need to ‘hand build’ your own Tweet Template. The means assembling an JSON object by picking and choosing the fields you want and copying them into a JSON file. When doing this, keep the following details in mind:

    • Tweet Template JSON must be valid for the conversion code to work. If the conversion code can not parse the JSON Template then it will exit. There are many on-line validators to confirm your JSON is formatted correctly.
    • Order of objects does not absolutely matter. You could have the actor object below the Twitter Entities object. However, the order will affect the order of the CSV columns in the output.
    • Array attributes only need an array length of one. The conversion process knows to export all array elements it finds.
    • Hierarchy matters. If you skip or add a level in the Template, that ‘pattern’ will not be found in the processed Tweets. For example:
      gnip.matching_rules.0.value has a numeric token, indicating an array, while gnip.matching_rules.value does
      not. Since he matching_rules attribute is an array the gnip.matching_rules.0.value will match, while the other will not.
    • Metadata values do not have to be internally consistent since the values of the JSON name/value pairs does not matter. All that matters are the JSON names. With the Tweet Template examples below you will see inconsistencies. For example the geographic metadata can be inconsistent with an actor location in one place and the Gnip Profile Geo in another.

    Here are several pre-built examples:

    • ‘Standard’ Tweet Template (tweet_standard.json): Handles both original Tweets and Retweets. No Twitter geo metadata, all twitter entities included with select attributes (i.e., no hashtag indices), includes standard Gnip enrichments (matching rules, urls, language). Retweets are indicated by verb, original tweet id, and author name/id.
    • ‘Tweet IDs’ Tweet Template (tweet_ids.json): For selecting just the numeric Tweet IDs.
    • ‘User IDs’ Tweet Template (user_ids.json): For selecting just the numeric User IDs.
    • ‘Small’ Tweet Template (tweet_small.json): For selecting just the basics.
    • ‘Everything’ Retweet Template (tweet_everything.json): Includes complete data, including the full Retweet and nested Tweet. Includes all Twitter entities and all attributes (like hashtag indices), Twitter geo metadata, and all Gnip enrichments.
    • ‘Standard + Geo’ Tweet Template (tweet_standard_geo.json): Same as the ‘Standard’ template, but also includes Twitter geo metadata.
    • ‘Profile Geo’ Tweet Template (tweet_profile_geo.json): Same as ‘Standard Geo’ Template, with the addition of the Profile Geo enrichment.
    • ‘All gnip enrichments’ Tweet Template (tweet_all_enrichments.json): Same as ‘Profile Geo’ Template, with the addition of Klout Topics data.

    Tags


    twitter, CSV, tools