Dump of igdb database or similar?

lunsjentilanette@sh.itjust.works · 9 days ago

Dump of igdb database or similar?

frongt@lemmy.zip · 9 days ago

Email them and ask. They can probably create you an account without twitch.

lunsjentilanette@sh.itjust.works · 9 days ago

I could try, but they are operated by twitch so i somehow doubt they will do that?

frongt@lemmy.zip · 9 days ago

Oh, maybe not then.

Unless you have some university affiliation you could spin as “I’m a researcher and…”

Babalugats@feddit.uk · 9 days ago

you could use free temp mail service (there are tons of them) or set up a dedicated private email account for these use cases.

Along with something like (some of these can be blocked on some websites)

Among others.

lunsjentilanette@sh.itjust.works · 9 days ago

Ive tried that, but have so far been unable to find a number that has not previously been used for twitch and after a few failed attempts, twitch stops sending out codes.

Ive also experimented with scraping but i am unable to form the request in a way that does not yield 403 for this and howlongtobeat for instance. I did get responses from rawg.io but while it seemed they have a lot of games, stuff like playtime didnt seem too reliable to me (europa univeralis 4 being sub 10hr for example… :p) and ive also yet to find a way to generate all urls to go through all their game pages (urls contain game name, not incremental ids).

sus@programming.dev · edit-2 9 days ago

The free tier of IGDB has a rate limit of 4 requests per second and they will block you if you try to download the entire database, so this may be a dead end anyway.

Found a 4 megabyte dump that seems promising though it’s from 2018 so outdated.

https://gist.github.com/LeWawan/5858a9e7bef0f3dc4a79ac8bc1e3380c

Another one, a 68MB 3 year old “dataset” for AI training that seems related to IGDB.

https://www.kaggle.com/datasets/anudeepvanjavakam/igdb-api-data/data

BakedCatboy@lemmy.ml · edit-2 9 days ago

I managed to request 363,020 entries at 500 page size 1 request per second in about 20 minutes, so they don’t seem to be super strict. Though weirdly after I load them into a dict to dedupe the IDs I get 362,969, haven’t looked into what’s going on there.

Babalugats@feddit.uk · 9 days ago

If you have lycamobile or similar in your country… You can walk into a store and just pick one up for free. In some countries they do have an activation fee, but I think that can be as little as €5 in Europe. Other providers offer similar, and then keep the number for 6 months or activate again.

BakedCatboy@lemmy.ml · edit-2 9 days ago

Is playtime called something else in igdb? I’m not seeing that one in the docs.

Edit: I guess igdb doesn’t have playtime, but if all you need from igdb is the /games endpoint, here’s a scrape of 363020 entries (it seems there may be duplicate IDs as loading them into a dict yields 362969 entries)

Edit2: I just realized that igdb has /game_time_to_beats so I guess you’d want that as well

Updated dump of the main games endpoint and the time to beat endpoint (contains only 8604 entries in comparison):

https://mega.nz/file/xdJnEJLD#PlblwLr22Yfea4GUERBLwTsuunbwE3pGsq41OBqIaDg

lunsjentilanette@sh.itjust.works · 8 days ago

I love you<3

BakedCatboy@lemmy.ml · edit-2 8 days ago

You’re welcome! It was a fun hyperfixation project. I ended up making the script so easy to use I decided to just scrape every other endpoint too, so if anyone wants it, here’s a full dump of every endpoint, it’s only like 4x bigger:

https://mega.nz/file/YF4F3bCS#pkS8Ki9QuucMGJF65YwGUE-NQZ78QEWs73fmF71qa18

And if anyone wants to do their own scraping to get more up to date data later, just pip install:

python-dotenv==1.2.2
Requests==2.34.2
tqdm==4.67.1

Put API keys in .env or export env vars:

CLIENT_ID=<client_id>
# Provide to fetch new token
CLIENT_SECRET=<client_secret>
# Optional, provide to reuse existing access token, secret will not be used
ACCESS_TOKEN=<access_token>

And just run python dump.py games or any other endpoint in the api docs like release_dates etc. It outputs the json and a simple log to an output folder wherever you ran it. No error handling or checkpointing so if it fails partway through you don’t get anything, but I didn’t have a single error the whole time.

usage: dump.py [-h] api_route

IGDB Dump Script

positional arguments:
  api_route   The API route to scrape, eg: games or game_time_to_beats

options:
  -h, --help  show this help message and exit

dump.py:

import argparse
import json
import logging
import os
import pathlib
import time
from dotenv import dotenv_values
import requests
from tqdm import tqdm

API_PAGE_SIZE = 500
OUT_DIR = "output"

config = {
    **dotenv_values(".env"),
    **os.environ,
}

# Set up flags / args
parser = argparse.ArgumentParser(
    prog="dump.py", description="IGDB Dump Script"
)
parser.add_argument(
    "api_route",
    help="The API route to scrape, eg: games or game_time_to_beats",
)
args = parser.parse_args()

# Create out dir
pathlib.Path.mkdir(OUT_DIR, parents=False, exist_ok=True)

# Set up logging to the route's file
tqdmHandler = logging.StreamHandler(tqdm)
tqdmHandler.terminator = ""
logging.basicConfig(
    level=logging.INFO,
    format="%(message)s",
    handlers=[
        logging.FileHandler(f"{OUT_DIR}/{args.api_route}.log"),
        tqdmHandler
    ],
)

# Check for existing json to prevent overwriting existing dumps
outFile = f"{OUT_DIR}/{args.api_route}.json"
if pathlib.Path(outFile).exists():
    print(f"Existing json found {outFile}, please move or remove it before proceeding")
    exit(1)

if config['CLIENT_ID'] and config['ACCESS_TOKEN']:
    logging.info("Using CLIENT_ID and existing ACCESS_TOKEN")
elif config['CLIENT_ID'] and config['CLIENT_SECRET'] and not config['ACCESS_TOKEN']:
    logging.info("Fetching new access token...")
    response = requests.post(
        url="https://id.twitch.tv/oauth2/token",
        params={
            "client_id": config['CLIENT_ID'],
            "client_secret": config['CLIENT_SECRET'],
            "grant_type": "client_credentials"
        },
        timeout=30
    )
    config['ACCESS_TOKEN'] = response.json()['access_token']
else:
    logging.info("Missing CLIENT_ID and CLIENT_SECRET or ACCESS_TOKEN")
    exit(1)

# Re-check access token in case fetch failed
if config['CLIENT_ID'] and config['ACCESS_TOKEN']:
    items = []
    offset = 0
    logging.info(f"Fetching batches of {API_PAGE_SIZE} on endpoint {args.api_route}")
    with tqdm() as pbar:
        while True:
            response = requests.post(
                url=f"https://api.igdb.com/v4/%7Bargs.api_route%7D",
                headers={
                    "Client-ID": config['CLIENT_ID'],
                    "Authorization": f"Bearer {config['ACCESS_TOKEN']}"
                },
                data=f"fields *; limit {API_PAGE_SIZE}; offset {offset};",
                timeout=30
            )
            newItems = response.json()
            fetchCount = len(newItems)
            pbar.update(fetchCount)
            if fetchCount != API_PAGE_SIZE:
                logging.info(f"WARN: Requested {API_PAGE_SIZE}, got {fetchCount}")
            offset += API_PAGE_SIZE
            items.extend(newItems)
            if fetchCount < API_PAGE_SIZE:
                logging.info("Received partial page, ending")
                break
            time.sleep(1)

    logging.info(f"Total fetched: {len(items)}")
    with open(outFile, "w", encoding="utf-8") as file:
        logging.info("Writing to json...")
        json.dump(items, file, ensure_ascii=False, indent=2)

    # Print some stats
    logging.info(f"\nChecking json output: {args.api_route}.json")

    entries = []
    with open(outFile, "r", encoding="utf-8") as file:
        entries = json.load(file)

    logging.info(f"{len(entries)} entries in json")

    entryDict = {}
    for entry in entries:
        entryDict.update({entry['id']: entry})

    logging.info(f"{len(entryDict)} unique IDs in json")
else:
    logging.error("Client ID or Access Token not available")
    exit(1)