jq reduce
I found myself having to process a large JSON file using jq --slurp, and it was using up a lot of memory, so I thought it was time to learn about jq’s reduce function.
reduce works like this:
reduce inputs as $line (INIT, REDUCER)
INIT is whatever you want; commonly you’ll use an empty object, {} or an empty array, [].
REDUCER is a jq transform that is given . as the current value of the accumulator, can access $line, and is expected to return the new accumulator value.
A do-nothing reduce would look like this:
reduce inputs as $line
({}; {})
That is: the initial accumulator value is the empty object, and the reducer ignores the input line and returns the empty object.
The following example returns the last value passed to it:
reduce inputs as $line
({}; .)
You can replicate --slurp with the following:
reduce inputs as $line
([]; . + [$line])
A more complicated example
Simplified, the JSON logs that I was processing look like this:
{"timestamp": "2019-01-09T06:52:58.079Z", "user": 1, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.148Z", "user": 1, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.171Z", "user": 2, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.178Z", "user": 1, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.179Z", "user": 3, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.231Z", "user": 2, "data": { ... }}
…and I want to count the number of events, per user, per minute.
I came up with the following:
jq 'reduce inputs as $line
({};
($line.user | tostring) as $user
| .[$user] as $current
| $line.timestamp[0:16] as $bucket
| $current[$bucket] as $count
| { ($bucket): ($count + 1) } as $this
| ($current + $this) as $next
| . + { ($user): ($next) }
)' \
raw.json
…which results in the following:
{
"1": {
"2019-01-09T06:52": 4,
"2019-01-09T06:53": 178,
"2019-01-09T06:54": 202,
... etc.
Here’s how it works
For each record ($line) in the input, reduce, starting with an empty object:
jq 'reduce inputs as $line
({};
Set $user to the value of the user field (an integer), converted to a string. It needs converting to a string because I’m going to use it as an object key later, and object keys can’t be numbers:
($line.user | tostring) as $user
The accumulator (.) is an object. .[$user] gets the current value of the field identified by $user:
| .[$user] as $current
In our example data, that’s equivalent to, say, acc["1"] or acc["2"]. Assign that to $current.
Then take the timestamp field, trim it to represent just the minute, by using the range expression [0:16], and assign it to $bucket:
| $line.timestamp[0:16] as $bucket
The current record for the user is an object, keyed by bucket, where the value is the current count:
| $current[$bucket] as $count
Create a new record with the same key, incrementing the count. The name $this could be confusing, but I couldn’t come up with a better name; stet:
| { ($bucket): ($count + 1) } as $this
Merge it with the current record:
| ($current + $this) as $next
Merge that with the user-keyed record:
| . + { ($user): ($next) }
Then there’s just closing brackets and stuff:
)' \
raw.json