Translating Plain Language to PowerTrack Rules
posted on 17 April 2014 by Leah Barren
Using Gnip PowerTrack rules to filter social data firehoses is key to ingesting just the right data for your platform. Whether your app can generate meaningful analysis or insights depends on receiving the exact data it needs, and PowerTrack rules are the lynchpin in that equation. PowerTrack provides extensive choices for defining the content in your stream – options include simple keyword matching, specifying specific user attributes, user location, language, and many more, all using complex boolean logic.
When defining rules that deliver the data you want, one of the biggest challenges is to take requirements as they’re expressed in plain language, and translate them precisely to the operators and filtering syntax. In this article, we will take rules that are articulated in English, and transform them into PowerTrack rules using the appropriate Gnip Operators and syntax.
Let’s begin with a simple theoretical example.
“I want Tweets discussing the Tesla Model S.”
The easiest place to start is to use the simple keyword ‘tesla’. However, that is only the starting point – it will certainly get us all Tweets that mention ‘tesla,’ but it’s not really targeted. It would also match Tweets where people are discussing Tesla coils, or Nikola Tesla and his rivalry with Thomas Edison, rather than the car. There are two primary ways to target the conversation more precisely: 1) add additional positive keyword requirements, and 2) specify specific keywords to exclude.
Adding additional positive keywords generally provides a more significant improvement in the level accuracy, so we’ll begin there. Here, this is fairly simple – we want to restrict our ‘tesla’ matches to those where the exact phrase ‘model s’ also appears. The PowerTrack rule to represent this looks like the following:
tesla "model s"
The double-quotes around the phrase ‘model s’ turn this into an exact phrase match – Tweets will only match where that exact phrase appears in that order within the text.
However, maybe you’ve observed that users also refer to this using ‘model-s’, rather than ‘model s’. If you want to also account for this scenario, you could use the following PowerTrack rule:
tesla ("model s" OR "model-s")
Here, we’ve employed grouping and boolean OR logic. To translate the rule above, Tweets would match where they contain ‘tesla’, and where they contain either ‘model s’ or ‘model-s’.
This concept can be expanded on further – imagine you only want Tweets like this which use emotionally-charged keywords like ‘angry’, ‘happy’, or ‘love’. These could be included as an additional group, using “and any of these” logic. For example:
tesla ("model s" OR "model-s") (angry OR happy OR love)
We now have a rule that is significantly more targeted than the original ‘tesla’ rule. However, you may want greater confidence that you will exclude specific types of references, if they somehow manage to meet the requirements you’ve already specified.
To do this, you simply exclude the types of mentions you’ve identified as undesirable by placing a - character in front of them. E.g.
tesla ("model s" OR "model-s") (angry OR happy OR love) -nikola -edison -coil -coils
Or, this can be written more simply using grouping and OR logic:
tesla ("model s" OR "model-s") (angry OR happy OR love) -(nikola OR edison OR coil OR coils)
We now have a highly targeted rule that will deliver just the Tweets with the types of text phrases we want. To recap, here is the plain-language statement of what our Tweet will deliver Tweets which:
1. Include ‘tesla’ in the text,
2. And which also have either ‘model s’ or ‘model-s’ in the text,
3. And which also have ‘angry’ or ‘happy’ or ‘love’ in the text,
4. And which DO NOT have ‘nikola’, ‘edison’, ‘coil’, or ‘coils’ in the text.
Now let’s look at an example that incorporates filtering on some of the unique aspects of social data, outside of just plain text matching.
“Our company is interested in some references about the TV show “Cutthroat Kitchen” on the Food Network, as well as references to the host (Alton Brown) that relate to the show.”
Using what we learned in the first example, we can create a rule to capture mentions like this.
"cutthroat kitchen" OR cutthroatkitchen OR (("alton brown" OR altonbrown) host OR show OR mean OR diabolical)
This rule would capture:
1. Tweets with the phrase “cutthroat kitchen” in the text
2. Tweets with ‘cutthroatkitchen in the text (same as above, but without the space)
3. Tweets mentioning alton brown (or ‘altonbrown’ without the space), where that tweet also mentions ‘host’, ‘show’, ‘mean’, or ‘diabolical’.
Now, let’s imagine that the company collects Tweets with this rule for a week, but their customer is unhappy with the quality of the results, and wants to target content more specifically. This time around, they want to narrow the results, but also want to capture some specific mentions they missed before.
“We only want Tweets using our promoted hashtag (#cutthroatkitchen), Tweets mentioning the show’s host by his Twitter handle (@altonbrown) in a way that relates to the show, or Tweets that link to the Food Network’s online page about Alton Brown. Additionally, we only want Tweets that come from users who say they are based in the United States.”
Let’s begin with the first requirement.
Our previous rule would have captured Tweets using the promoted hashtag, thanks to our ‘cutthroatkitchen’ term. This would have matched mentions like ‘#cutthroatkitchen’, ‘@cutthroatkitchen’, or just a bare reference to ‘cutthroatkitchen’. These matches are due to the use of a tokenized match. However, our customer wants to restrict this to only match on uses of ‘#cutthroatkitchen’. To do this, we’ll use the # operator from PowerTrack as follows:
While the original ‘cutthroatkitchen’ term looked for matches in the general text of the Tweet, this rule actually changes the strategy, and looks for a match in the list of hashtags Twitter has extracted from the Tweet itself (in the twitter_entities.hashtags field). Thus, it provides a much more targeted way for the customer to ensure they are only getting hashtag mentions of the phrase.
A similar concept applies to restricting the mentions of Alton Brown to those using his Twitter handle. The previous term (‘altonbrown’) would have gotten all mentions in the text of that specific string. However, we can use the @ operator in PowerTrack to restrict it to ONLY explicit references to his Twitter handle.
This means PowerTrack will look in Twitter’s extracted user mentions for a match, rather than the general Text used in the Tweet.
Next, the customer was previously getting some Tweets linking to their web page about Alton Brown just by looking for mentions of his name within the text, including any URLs included in the text of the Tweet. However, they were missing some references where Twitter users shortened their URLs with services like bit.ly before posting them. To accommodate this need, we need to use the url_contains: operator with the specific URL the customer wants to track.
The url_contains: operator tells PowerTrack to look for matches in the fully expanded URL that is provided in the Tweet as an enrichment by Gnip. In other words, even if a URL is wrapped in a ‘bit.ly’ or other shortened link, Gnip will unwind it down to the final URL and allow you to look for matches there.
Last, the customer wants to restrict the results to Tweets where the user is from the United States. To do this, we will use Gnip’s Profile Geo enrichment, and corresponding PowerTrack operator to apply the restriction to all of the previously defined terms.
Incorporating the changes above, we can come up with a rule that will satisfy the customer:
profile_country_code:us (#cutthroatkitchen OR (@altonbrown (host OR show OR mean OR diabolical)) OR url_contains:"foodnetwork.com/chefs/alton-brown")
This would give us
- Tweets using the company’s promoted hashtag, but not those using the keyword without the hashtag
- Mentions of @altonbrown that also mention ‘host’, ‘show’, ‘mean’, or ‘diabolical’, but excluding plain-text mentions that don’t use @ mention syntax.
- Tweets that include links to the Food Network’s page about Alton Brown, even where they are shortened using bit.ly or another service.
- Additionally, no Tweets meeting the requirements above will be delivered unless they also have a profile country code for the United States, based on Gnip’s Profile Geo enrichment.
The syntax used is important – the use of parentheses where appropriate creates the boolean logic we want, and ensure that the ‘profile_country_code:us’ term is applied across the board. When in doubt, use parentheses to be sure you don’t end up with unexpected results due to the order of operations for PowerTrack rules.
Beyond these examples, there are hundreds of ways that you can combine operators and keywords to return the data that is critical to your analysis. Expanding these concepts to narrow your search based on profile information, follower count, Tweet location, language used in the text, and many more. In addition to the topics discussed here, you should be well-versed in the full documentation around PowerTrack rules, including the limits around restricted characters and rule size.
Keep an eye out for more help articles that will detail best practices for various operators and how to use them in conjunction with Gnip’s Enrichments to return more of the data that you need.