jq reduce
I found myself having to process a large JSON file using jq --slurp
, and it was using up a lot of memory, so I thought it was time to learn about jq
’s reduce
function.
reduce
works like this:
reduce inputs as $line (INIT, REDUCER)
INIT
is whatever you want; commonly you’ll use an empty object, {}
or an empty array, []
.
REDUCER
is a jq
transform that is given .
as the current value of the accumulator, can access $line
, and is expected to return the new accumulator value.
A do-nothing reduce
would look like this:
reduce inputs as $line
({}; {})
That is: the initial accumulator value is the empty object, and the reducer ignores the input line and returns the empty object.
The following example returns the last value passed to it:
reduce inputs as $line
({}; .)
You can replicate --slurp
with the following:
reduce inputs as $line
([]; . + [$line])
A more complicated example
Simplified, the JSON logs that I was processing look like this:
{"timestamp": "2019-01-09T06:52:58.079Z", "user": 1, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.148Z", "user": 1, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.171Z", "user": 2, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.178Z", "user": 1, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.179Z", "user": 3, "data": { ... }}
{"timestamp": "2019-01-09T06:52:58.231Z", "user": 2, "data": { ... }}
…and I want to count the number of events, per user, per minute.
I came up with the following:
jq 'reduce inputs as $line
({};
($line.user | tostring) as $user
| .[$user] as $current
| $line.timestamp[0:16] as $bucket
| $current[$bucket] as $count
| { ($bucket): ($count + 1) } as $this
| ($current + $this) as $next
| . + { ($user): ($next) }
)' \
raw.json
…which results in the following:
{
"1": {
"2019-01-09T06:52": 4,
"2019-01-09T06:53": 178,
"2019-01-09T06:54": 202,
... etc.
Here’s how it works
For each record ($line
) in the input, reduce, starting with an empty object:
jq 'reduce inputs as $line
({};
Set $user
to the value of the user field (an integer), converted to a string. It needs converting to a string because I’m going to use it as an object key later, and object keys can’t be numbers:
($line.user | tostring) as $user
The accumulator (.
) is an object. .[$user]
gets the current value of the field identified by $user
:
| .[$user] as $current
In our example data, that’s equivalent to, say, acc["1"]
or acc["2"]
. Assign that to $current
.
Then take the timestamp field, trim it to represent just the minute, by using the range expression [0:16]
, and assign it to $bucket
:
| $line.timestamp[0:16] as $bucket
The current record for the user is an object, keyed by bucket, where the value is the current count:
| $current[$bucket] as $count
Create a new record with the same key, incrementing the count. The name $this
could be confusing, but I couldn’t come up with a better name; stet:
| { ($bucket): ($count + 1) } as $this
Merge it with the current record:
| ($current + $this) as $next
Merge that with the user-keyed record:
| . + { ($user): ($next) }
Then there’s just closing brackets and stuff:
)' \
raw.json