Retrieving Data

This section covers methods for retrieving datasets, as well as the structure of saved data.

Methods

There are several ways to retrieve experiment data from the database.

Retrieving using download_datafiles

The simplest way to retrieve data is using the download_datafiles command. This creates three csv files containing the three kinds of data: trial data, question data, and event data.

Retrieving programmatically

While the download_datafiles shell command is the simplest way to retrieve experiment data, a more powerful and flexible solution is to retrieve the data programmatically. Many languages offer libraries for interfacing with mysql and sqlite databases - below is an example using python and the sqlalchemy package to retrieve data from a mysql database. We add +pymysql to the db_url to let sqlalchemy make use of pymysql package. (You can leave the database_url in config.txt as mysql:// though – psiturk adds +pymysql internally). By including code such as this at the beginning of your analysis script, you can be sure the the data you’re analyzing is always complete and up-to-date.

from sqlalchemy import create_engine, MetaData, Table
import json
import pandas as pd

db_url = "mysql+pymysql://username:password@host.org/database_name"
table_name = 'my_experiment_table'
data_column_name = 'datastring'
# boilerplace sqlalchemy setup
engine = create_engine(db_url)
metadata = MetaData()
metadata.bind = engine
table = Table(table_name, metadata, autoload=True)
# make a query and loop through
s = table.select()
rows = s.execute()

data = []
#status codes of subjects who completed experiment
statuses = [3,4,5,7]
# if you have workers you wish to exclude, add them here
exclude = []
for row in rows:
    # only use subjects who completed experiment and aren't excluded
    if row['status'] in statuses and row['uniqueid'] not in exclude:
        data.append(row[data_column_name])

# Now we have all participant datastrings in a list.
# Let's make it a bit easier to work with:

# parse each participant's datastring as json object
# and take the 'data' sub-object
data = [json.loads(part)['data'] for part in data]

# insert uniqueid field into trialdata in case it wasn't added
# in experiment:
for part in data:
    for record in part:
        record['trialdata']['uniqueid'] = record['uniqueid']

# flatten nested list so we just have a list of the trialdata recorded
# each time psiturk.recordTrialData(trialdata) was called.
data = [record['trialdata'] for part in data for record in part]

# Put all subjects' trial data into a dataframe object from the
# 'pandas' python library: one option among many for analysis
data_frame = pd.DataFrame(data)

How the datastring is structured

The main data from an experiment participant is held in a string of text in the datastring field of the data table. Understanding how this string is structured is important to be able to parse the string into a useful format for your analyses.

The datastring is structured as a json object. In the description that follows, sub-objects are indicated by names wrapped in angle brackets (< >).

Top Level

The top level of the datastring contains summary information about the worker, as well as the datastring sub-objects:

{"condition": condition,
"counterbalance": counterbalance,
"assignmentId": assignmentId,
"workerId": workerId,
"hitId": hitId,
"currenttrial": trial_number_when_data_was_saved,
"useragent": useragent,
"data": <data>,
"questiondata": <questiondata>,
"eventdata": <eventdata>,
"mode": <mode>}

data

The data sub-object contains a list of the data recorded each time psiturk.recordTrialData() is called in the experiment:

[
  {
    "uniqueid": uniqueid,
    "current_trial": current_trial_based_on_num_of_calls_to_recordTrialData,
    "dataTime": current_time_in_system_time,
    "trialdata": //<datalist>
  }
  //,
  // ...
]

Here, <datalist> is whatever is passed to psiturk.recordTrialData() in the experiment. This could be in any format, such as a string or list, but we recommend saving data in a json format so that all data is stored in a clear, easy-to-parse “field-value” format. <dataTime> is recorded in UTC time.

questiondata

The questiondata sub-object contains all items recorded using psiturk.recordUnstructuredlData().

{"field1": value1,
 "field2": value2,
 ...
}

eventdata

The eventdata sub-object contains a list of events (such as window resizing) that occurred during the experiments:

[{"eventtype": eventtype,
  "value": value,
  "timestamp": current_time_in_system_time,
  "interval": interval},
  ...
 ]