How to Scrape Reddit with Google Scripts

4 years ago

3 minutes

Web scraping is essentially the act of extracting data from websites and typically storing it automatically through an internet server or HTTP. Reddit features a fairly substantial API that anyone can use to extract data from subreddits. You’ll fetch posts, user comments, image thumbnails, other attributes that are attached to a post on Reddit.

However, the posts got to be recent and are capped at a maximum of 1000 posts. Any posts before that can’t be scraped by the system API. Though there exist other software none of them are as thorough and straightforward to use as Google Script. Though there exist many solutions which involve Javascript (Node.js) and python, they’re quite complicated to know and execute.

Hence, we use Google script which may save all the posts, comments on a subreddit to a Google Sheet on your Google Drive and since we are using pushshift.io (a storage container developed by Jason Baumgartner which may analyze large amounts of data) rather than the official Reddit API, there’s no cap. It’ll download everything that’s every posted on a subreddit.

To start, open the Google Sheet and make a replica in your Google Drive.
Go to Tools -> Script editor to open the Google Script which will get all the info from the required subreddit. Go to line 55 and alter technology to the name of the subreddit that you simply wish to scrape.
While you’re within the script editor, choose Run -> scrapeReddit.

After you authorize the script, within a moment or two, all the Reddit posts are going to be added to your Google Sheet in your Drive.

const isRateLimited = () => {

const response = UrlFetchApp.fetch('https://api.pushshift.io/meta');

const { server_ratelimit_per_minute: limit } = JSON.parse(response);

return limit < 1;

};

Next, we specify the subreddit name and run our script to fetch posts in batches of 1000 each. Once a batch is complete, we write the data to a Google Sheet.

const getAPIEndpoint_ = (subreddit, before = '') => {

const fields = ['title', 'created_utc', 'url', 'thumbnail', 'full_link'];

const size = 1000;

const base = 'https://api.pushshift.io/reddit/search/submission';

const params = { subreddit, size, fields: fields.join(',') };

if (before) params.before = before;

const query = Object.keys(params)

.map(key => `${key}=${params[key]}`)

.join('&');

return `${base}?${query}`;

};

const scrapeReddit = (subreddit = 'technology') => {

let before = '';

do {

const apiUrl = getAPIEndpoint_(subreddit, before);

const response = UrlFetchApp.fetch(apiUrl);

const { data } = JSON.parse(response);

const { length } = data;

before = length > 0 ? String(data[length - 1].created_utc) : '';

if (length > 0) {

writeDataToSheets_(data);

}

} while (before !== '' && !isRateLimited());

};

The output of the push shift service is very vast and we narrow it down to the relevant fields using the “fields parameter”.

If the response contains an image, we convert that into a Google Sheets function so you can preview the image inside the sheet.

const getThumbnailLink_ = url => {

if (!/^http/.test(url)) return '';

return `=IMAGE("${url}")`;

};

const getHyperlink_ = (url, text) => {

if (!/^http/.test(url)) return '';

return `=HYPERLINK("${url}", "${text}")`;

};

And that’s all. You’re now equipped to scrape every post consistent with your requirement from Reddit. Cheers!

#google #Google Scripts