I am interested in accessing the data of igdb.com for a hobby project. They have an api that is “free”, ie it requires a twitch account with 2fa enabled which means i would need to hand over my phone number to twitch. That is a hard no from my side.

Anyone know of a way to get a dump of their database or a similar source where i could get api access without revealing personal information? For my project i am after only title, release year, some sort of rating and some sort of playtime.

  • frongt@lemmy.zip
    link
    fedilink
    English
    arrow-up
    3
    ·
    9 days ago

    Email them and ask. They can probably create you an account without twitch.

    • lunsjentilanette@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      9 days ago

      Ive tried that, but have so far been unable to find a number that has not previously been used for twitch and after a few failed attempts, twitch stops sending out codes.

      Ive also experimented with scraping but i am unable to form the request in a way that does not yield 403 for this and howlongtobeat for instance. I did get responses from rawg.io but while it seemed they have a lot of games, stuff like playtime didnt seem too reliable to me (europa univeralis 4 being sub 10hr for example… :p) and ive also yet to find a way to generate all urls to go through all their game pages (urls contain game name, not incremental ids).

      • sus@programming.dev
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        9 days ago

        The free tier of IGDB has a rate limit of 4 requests per second and they will block you if you try to download the entire database, so this may be a dead end anyway.

        Found a 4 megabyte dump that seems promising though it’s from 2018 so outdated.

        https://gist.github.com/LeWawan/5858a9e7bef0f3dc4a79ac8bc1e3380c

        Another one, a 68MB 3 year old “dataset” for AI training that seems related to IGDB.

        https://www.kaggle.com/datasets/anudeepvanjavakam/igdb-api-data/data

        • BakedCatboy@lemmy.ml
          link
          fedilink
          English
          arrow-up
          4
          ·
          edit-2
          9 days ago

          I managed to request 363,020 entries at 500 page size 1 request per second in about 20 minutes, so they don’t seem to be super strict. Though weirdly after I load them into a dict to dedupe the IDs I get 362,969, haven’t looked into what’s going on there.

      • Babalugats@feddit.uk
        link
        fedilink
        English
        arrow-up
        2
        ·
        9 days ago

        If you have lycamobile or similar in your country… You can walk into a store and just pick one up for free. In some countries they do have an activation fee, but I think that can be as little as €5 in Europe. Other providers offer similar, and then keep the number for 6 months or activate again.

  • BakedCatboy@lemmy.ml
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    9 days ago

    Is playtime called something else in igdb? I’m not seeing that one in the docs.

    Edit: I guess igdb doesn’t have playtime, but if all you need from igdb is the /games endpoint, here’s a scrape of 363020 entries (it seems there may be duplicate IDs as loading them into a dict yields 362969 entries)

    Edit2: I just realized that igdb has /game_time_to_beats so I guess you’d want that as well

    Updated dump of the main games endpoint and the time to beat endpoint (contains only 8604 entries in comparison):

    https://mega.nz/file/xdJnEJLD#PlblwLr22Yfea4GUERBLwTsuunbwE3pGsq41OBqIaDg

      • BakedCatboy@lemmy.ml
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        8 days ago

        You’re welcome! It was a fun hyperfixation project. I ended up making the script so easy to use I decided to just scrape every other endpoint too, so if anyone wants it, here’s a full dump of every endpoint, it’s only like 4x bigger:

        https://mega.nz/file/YF4F3bCS#pkS8Ki9QuucMGJF65YwGUE-NQZ78QEWs73fmF71qa18

        And if anyone wants to do their own scraping to get more up to date data later, just pip install:

        python-dotenv==1.2.2
        Requests==2.34.2
        tqdm==4.67.1
        

        Put API keys in .env or export env vars:

        CLIENT_ID=<client_id>
        # Provide to fetch new token
        CLIENT_SECRET=<client_secret>
        # Optional, provide to reuse existing access token, secret will not be used
        ACCESS_TOKEN=<access_token>
        

        And just run python dump.py games or any other endpoint in the api docs like release_dates etc. It outputs the json and a simple log to an output folder wherever you ran it. No error handling or checkpointing so if it fails partway through you don’t get anything, but I didn’t have a single error the whole time.

        usage: dump.py [-h] api_route
        
        IGDB Dump Script
        
        positional arguments:
          api_route   The API route to scrape, eg: games or game_time_to_beats
        
        options:
          -h, --help  show this help message and exit
        

        dump.py:

        import argparse
        import json
        import logging
        import os
        import pathlib
        import time
        from dotenv import dotenv_values
        import requests
        from tqdm import tqdm
        
        API_PAGE_SIZE = 500
        OUT_DIR = "output"
        
        config = {
            **dotenv_values(".env"),
            **os.environ,
        }
        
        # Set up flags / args
        parser = argparse.ArgumentParser(
            prog="dump.py", description="IGDB Dump Script"
        )
        parser.add_argument(
            "api_route",
            help="The API route to scrape, eg: games or game_time_to_beats",
        )
        args = parser.parse_args()
        
        # Create out dir
        pathlib.Path.mkdir(OUT_DIR, parents=False, exist_ok=True)
        
        # Set up logging to the route's file
        tqdmHandler = logging.StreamHandler(tqdm)
        tqdmHandler.terminator = ""
        logging.basicConfig(
            level=logging.INFO,
            format="%(message)s",
            handlers=[
                logging.FileHandler(f"{OUT_DIR}/{args.api_route}.log"),
                tqdmHandler
            ],
        )
        
        # Check for existing json to prevent overwriting existing dumps
        outFile = f"{OUT_DIR}/{args.api_route}.json"
        if pathlib.Path(outFile).exists():
            print(f"Existing json found {outFile}, please move or remove it before proceeding")
            exit(1)
        
        if config['CLIENT_ID'] and config['ACCESS_TOKEN']:
            logging.info("Using CLIENT_ID and existing ACCESS_TOKEN")
        elif config['CLIENT_ID'] and config['CLIENT_SECRET'] and not config['ACCESS_TOKEN']:
            logging.info("Fetching new access token...")
            response = requests.post(
                url="https://id.twitch.tv/oauth2/token",
                params={
                    "client_id": config['CLIENT_ID'],
                    "client_secret": config['CLIENT_SECRET'],
                    "grant_type": "client_credentials"
                },
                timeout=30
            )
            config['ACCESS_TOKEN'] = response.json()['access_token']
        else:
            logging.info("Missing CLIENT_ID and CLIENT_SECRET or ACCESS_TOKEN")
            exit(1)
        
        # Re-check access token in case fetch failed
        if config['CLIENT_ID'] and config['ACCESS_TOKEN']:
            items = []
            offset = 0
            logging.info(f"Fetching batches of {API_PAGE_SIZE} on endpoint {args.api_route}")
            with tqdm() as pbar:
                while True:
                    response = requests.post(
                        url=f"https://api.igdb.com/v4/%7Bargs.api_route%7D",
                        headers={
                            "Client-ID": config['CLIENT_ID'],
                            "Authorization": f"Bearer {config['ACCESS_TOKEN']}"
                        },
                        data=f"fields *; limit {API_PAGE_SIZE}; offset {offset};",
                        timeout=30
                    )
                    newItems = response.json()
                    fetchCount = len(newItems)
                    pbar.update(fetchCount)
                    if fetchCount != API_PAGE_SIZE:
                        logging.info(f"WARN: Requested {API_PAGE_SIZE}, got {fetchCount}")
                    offset += API_PAGE_SIZE
                    items.extend(newItems)
                    if fetchCount < API_PAGE_SIZE:
                        logging.info("Received partial page, ending")
                        break
                    time.sleep(1)
        
            logging.info(f"Total fetched: {len(items)}")
            with open(outFile, "w", encoding="utf-8") as file:
                logging.info("Writing to json...")
                json.dump(items, file, ensure_ascii=False, indent=2)
        
            # Print some stats
            logging.info(f"\nChecking json output: {args.api_route}.json")
        
            entries = []
            with open(outFile, "r", encoding="utf-8") as file:
                entries = json.load(file)
        
            logging.info(f"{len(entries)} entries in json")
        
            entryDict = {}
            for entry in entries:
                entryDict.update({entry['id']: entry})
        
            logging.info(f"{len(entryDict)} unique IDs in json")
        else:
            logging.error("Client ID or Access Token not available")
            exit(1)