I am interested in accessing the data of igdb.com for a hobby project. They have an api that is “free”, ie it requires a twitch account with 2fa enabled which means i would need to hand over my phone number to twitch. That is a hard no from my side.
Anyone know of a way to get a dump of their database or a similar source where i could get api access without revealing personal information? For my project i am after only title, release year, some sort of rating and some sort of playtime.
Email them and ask. They can probably create you an account without twitch.
I could try, but they are operated by twitch so i somehow doubt they will do that?
Oh, maybe not then.
Unless you have some university affiliation you could spin as “I’m a researcher and…”
you could use free temp mail service (there are tons of them) or set up a dedicated private email account for these use cases.
Along with something like (some of these can be blocked on some websites)
Among others.
Ive tried that, but have so far been unable to find a number that has not previously been used for twitch and after a few failed attempts, twitch stops sending out codes.
Ive also experimented with scraping but i am unable to form the request in a way that does not yield 403 for this and howlongtobeat for instance. I did get responses from rawg.io but while it seemed they have a lot of games, stuff like playtime didnt seem too reliable to me (europa univeralis 4 being sub 10hr for example… :p) and ive also yet to find a way to generate all urls to go through all their game pages (urls contain game name, not incremental ids).
The free tier of IGDB has a rate limit of 4 requests per second and they will block you if you try to download the entire database, so this may be a dead end anyway.
Found a 4 megabyte dump that seems promising though it’s from 2018 so outdated.
https://gist.github.com/LeWawan/5858a9e7bef0f3dc4a79ac8bc1e3380c
Another one, a 68MB 3 year old “dataset” for AI training that seems related to IGDB.
https://www.kaggle.com/datasets/anudeepvanjavakam/igdb-api-data/data
I managed to request 363,020 entries at 500 page size 1 request per second in about 20 minutes, so they don’t seem to be super strict. Though weirdly after I load them into a dict to dedupe the IDs I get 362,969, haven’t looked into what’s going on there.
If you have lycamobile or similar in your country… You can walk into a store and just pick one up for free. In some countries they do have an activation fee, but I think that can be as little as €5 in Europe. Other providers offer similar, and then keep the number for 6 months or activate again.
Is playtime called something else in igdb? I’m not seeing that one in the docs.
Edit: I guess igdb doesn’t have playtime, but if all you need from igdb is the /games endpoint, here’s a scrape of 363020 entries (it seems there may be duplicate IDs as loading them into a dict yields 362969 entries)
Edit2: I just realized that igdb has /game_time_to_beats so I guess you’d want that as well
Updated dump of the main games endpoint and the time to beat endpoint (contains only 8604 entries in comparison):
https://mega.nz/file/xdJnEJLD#PlblwLr22Yfea4GUERBLwTsuunbwE3pGsq41OBqIaDg
I love you<3
You’re welcome! It was a fun hyperfixation project. I ended up making the script so easy to use I decided to just scrape every other endpoint too, so if anyone wants it, here’s a full dump of every endpoint, it’s only like 4x bigger:
https://mega.nz/file/YF4F3bCS#pkS8Ki9QuucMGJF65YwGUE-NQZ78QEWs73fmF71qa18
And if anyone wants to do their own scraping to get more up to date data later, just pip install:
python-dotenv==1.2.2 Requests==2.34.2 tqdm==4.67.1Put API keys in .env or export env vars:
CLIENT_ID=<client_id> # Provide to fetch new token CLIENT_SECRET=<client_secret> # Optional, provide to reuse existing access token, secret will not be used ACCESS_TOKEN=<access_token>And just run
python dump.py gamesor any other endpoint in the api docs likerelease_datesetc. It outputs the json and a simple log to an output folder wherever you ran it. No error handling or checkpointing so if it fails partway through you don’t get anything, but I didn’t have a single error the whole time.usage: dump.py [-h] api_route IGDB Dump Script positional arguments: api_route The API route to scrape, eg: games or game_time_to_beats options: -h, --help show this help message and exitdump.py:import argparse import json import logging import os import pathlib import time from dotenv import dotenv_values import requests from tqdm import tqdm API_PAGE_SIZE = 500 OUT_DIR = "output" config = { **dotenv_values(".env"), **os.environ, } # Set up flags / args parser = argparse.ArgumentParser( prog="dump.py", description="IGDB Dump Script" ) parser.add_argument( "api_route", help="The API route to scrape, eg: games or game_time_to_beats", ) args = parser.parse_args() # Create out dir pathlib.Path.mkdir(OUT_DIR, parents=False, exist_ok=True) # Set up logging to the route's file tqdmHandler = logging.StreamHandler(tqdm) tqdmHandler.terminator = "" logging.basicConfig( level=logging.INFO, format="%(message)s", handlers=[ logging.FileHandler(f"{OUT_DIR}/{args.api_route}.log"), tqdmHandler ], ) # Check for existing json to prevent overwriting existing dumps outFile = f"{OUT_DIR}/{args.api_route}.json" if pathlib.Path(outFile).exists(): print(f"Existing json found {outFile}, please move or remove it before proceeding") exit(1) if config['CLIENT_ID'] and config['ACCESS_TOKEN']: logging.info("Using CLIENT_ID and existing ACCESS_TOKEN") elif config['CLIENT_ID'] and config['CLIENT_SECRET'] and not config['ACCESS_TOKEN']: logging.info("Fetching new access token...") response = requests.post( url="https://id.twitch.tv/oauth2/token", params={ "client_id": config['CLIENT_ID'], "client_secret": config['CLIENT_SECRET'], "grant_type": "client_credentials" }, timeout=30 ) config['ACCESS_TOKEN'] = response.json()['access_token'] else: logging.info("Missing CLIENT_ID and CLIENT_SECRET or ACCESS_TOKEN") exit(1) # Re-check access token in case fetch failed if config['CLIENT_ID'] and config['ACCESS_TOKEN']: items = [] offset = 0 logging.info(f"Fetching batches of {API_PAGE_SIZE} on endpoint {args.api_route}") with tqdm() as pbar: while True: response = requests.post( url=f"https://api.igdb.com/v4/%7Bargs.api_route%7D", headers={ "Client-ID": config['CLIENT_ID'], "Authorization": f"Bearer {config['ACCESS_TOKEN']}" }, data=f"fields *; limit {API_PAGE_SIZE}; offset {offset};", timeout=30 ) newItems = response.json() fetchCount = len(newItems) pbar.update(fetchCount) if fetchCount != API_PAGE_SIZE: logging.info(f"WARN: Requested {API_PAGE_SIZE}, got {fetchCount}") offset += API_PAGE_SIZE items.extend(newItems) if fetchCount < API_PAGE_SIZE: logging.info("Received partial page, ending") break time.sleep(1) logging.info(f"Total fetched: {len(items)}") with open(outFile, "w", encoding="utf-8") as file: logging.info("Writing to json...") json.dump(items, file, ensure_ascii=False, indent=2) # Print some stats logging.info(f"\nChecking json output: {args.api_route}.json") entries = [] with open(outFile, "r", encoding="utf-8") as file: entries = json.load(file) logging.info(f"{len(entries)} entries in json") entryDict = {} for entry in entries: entryDict.update({entry['id']: entry}) logging.info(f"{len(entryDict)} unique IDs in json") else: logging.error("Client ID or Access Token not available") exit(1)

