Scraping the bowels of the Twitch API with Node.js

I had one deeply important question: who is on Twitch right now with nobody watching them? I find the performativeness of Twitch pretty hilarious and wanted to see what people are doing sans audience.

Screen Shot 2018-10-31 at 13.10.46.png

So I built Streaming Into the Void. I’ve since seen some great stuff like 24/7 Screaming Cowboy and a woman metal detecting alone in a forest.

The process of making this was about as strange and painful as the streams it scrapes, so here’s the story of how I did it.

The Twitch API #

Twitch has a new API which you can use to grab streams. It has a few unique attributes:

Build a scheduler #

Based on those limitations, I wanted to build a scraper that would fire every minute and make the maximum number of calls possible. And because I wanted to show currently live streams with 0 viewers, this scraper would need to be running all the time. I set my server up to run a scraping function asynchronously on an interval:

setInterval(
  async () => {
    batchNum++;
    console.log('Scheduled batch', batchNum, token);
    let result = await fetchStreams.fetchBatch(token, cursor);
    cursor = result.cursor;
  },
  60000);

I would need to save the pagination cursor value I received in the response and send it in the subsequent request, letting me get the next page of results with the next batch. In my fetch function, if there are 0 results returned, I reset the cursor to start at the top.

Build a scraper #

First, I kick off a loop of 120 calls (the rate limit). Each time, I construct a URI something like this:
https://api.twitch.tv/helix/streams?first=100&language=en&after=[cursor]&game_id=[game_id_1]&game_id=[game_id_2]

I discovered that in order to specify multiple games to filter by, I had to repeat the &game_id parameter many times.

for(let i = 0; i < 120; i++) {
      let streams = await fetchStreams(cursor, token);

      if(streams.error) {
        console.error(`==> Bad response received (${streams.error.status}): ${streams.error.error}`);
        break;
      }

      cursor = (streams.data.length > 0
        ? streams.pagination.cursor
        : null);
}

I’m using the request-promise-native package to async/await a response. If I get rate limited, I just break the loop.

Within each API call, upon receiving the response, I filter the streams by viewer count:

const filterStreams = async (streams) => {
  let count = 0;
  for(const stream of streams) {
    // filter out streams
    if(stream && stream.viewer_count < 2 && stream.type === 'live') {
      // add stream to db
      addVoid(stream);
      count++;
    }
  }
  return count;
}

Saving the voids #

I was interested in trying PouchDB, a Javascript version of CouchDB, which is a lightweight NoSQL database. I found it particularly appealing because it has a built-in API, letting me skip building out models.

You can create a database with just a few lines:

const PouchDB = require('pouchdb-node').defaults({
  prefix: tmpPath + '/pouch/',
  auto_compaction: true
});
const Voids = new PouchDB('voids');

The defaults reflect a very tricky deployment to my Heroku-like host, Now.sh. CouchDB relies on writing to local files and my host is read-only after deployment. I found that I was able to build the database in a writable /tmp/ directory and since have it working with a Docker deployment.

With an upsert plugin, I can insert/update stream data like so:

const addVoid = async (stream) => {
  const newDoc = {
    _id: stream.id,
    'type': 'stream',
    'created_at': Date.now(),
    'user_id': stream.user_id,
    'title': stream.title,
    'started_at': stream.started_at,
    'language': stream.language,
    'thumbnail_url': stream.thumbnail_url,
    'viewers': stream.viewer_count
  };

  Voids.upsert(_id, (oldDoc) => {
    return newDoc;
  }).catch(e => console.log(e));
}

Querying the void #

It’s possible to connect to a PouchDB on the front-end (in my case, in a React component named StreamList) just like you do on the server. I wanted to be able to randomly pluck out 2 streams that I’ve fetched in the last few minutes, which have the highest likelihood of still being live.

To do that with PouchDB, I needed to create an index on the created_at timestamp:

Voids.createIndex({
    index: {
      fields: ['created_at']
    }
  });

I generate a timestamp in the last 2 minutes as a “starting point” for a query that A) selects 2 streams greater than that random timestamp and B) excludes streams the user has already seen:

let queryAtTime = Date.now() - Math.floor(this.randBetween(0, 2) * 60 * 1000);
let query = {
  selector: {
    _id: {
      $nin: this.state.seen
    },
    created_at: {
      $gte: queryAtTime
    }
  },
  sort: ['created_at'],
  limit: 2

Each time the query runs, I save the IDs of all the streams the user has seen to the component’s state.

Fin #

With that, the scraper and front-end largely work! Unfortunately, grabbing 2 streams on the front-end is slow right now (up to 800ms), and I’m trying to figure out why. I have a few suspicions:

  1. The query is slow because the index isn’t efficient
  2. The query is slow because filtering out already-seen IDs in the query is inefficient
  3. The PouchDB API running on my host is slow but the query is fine

There are also more dead streams than I’d like, my guess is because people go live, get discouraged because they have no viewers, and immediately go offline. One idea I’ve heard is to filter by streams that have been live for more than 10 minutes. I also wish I could block the ads, which significantly detract from the experience, but there’s not much I can do there ;)

Thanks for following along - hope you enjoyed learning a little bit about scraping Twitch.tv with Node.js!

 
6
Kudos
 
6
Kudos

Now read this

Raccoon, a new digital wellbeing app

At the beginning of this year, I started having RSI-associated arm pain and started wearing a wrist brace. Coincidentally, around the same time, I was experiencing headaches and eye strain while using the computer. My computer was... Continue →