Natural language processing-based part of speech tags characteristic of childhood onset psychosis

Anthony J. Deo 1,2,3, Cynthia Lando 4, Andrew Carolan 1, Johanne Solis 5, Yuli Fradkin 1, Thanharat Silamongkol 4, Chloe Rosenkranz 4, Benjamin Herrera 6, Walter Barr 3, Emma Deaso 7, David Glahn 2,7, Michele Pato 1, David Zald 1, Carlos Pato 1

1 Department of Psychiatry, Rutgers-Robert Wood Johnson Medical School, Piscataway, NJ

2 Department of Psychiatry, Harvard Medical School, Boston, MA

3 Psychiatric Evaluation of Adolescent and Child Experiences (PEACE) Program, Rutgers University Behavioral Health Care, Piscataway, NJ

4 Graduate School of Applied and Professional Psychology, Rutgers University, Piscataway, NJ

5 The Rutgers-Princeton Center for Computational Cognitive Neuropsychiatry, Piscataway, NJ

6 Rutgers New Jersey Medical School, Newark, NJ

7 Department of Psychiatry and Behavioral Sciences, Boston Children’s Hospital, Harvard Medical School, Boston, MA

Background

Childhood onset schizophrenia is associated with both language delays and persistent subtle changes in language structure. Recent advances in natural language processing (NLP) have revealed consistent quantifiable changes in language structure predictive of diagnosis in clinical high risk for psychosis (CHR), schizophrenia, and bipolar spectrum disorders in older adolescents and adults. NLP-based measures have not been examined in children and young adolescents with a broad psychosis phenotype.

Background

All protocols were approved by the Institutional Review Board at Rutgers University. Cases were defined as children and adolescents with a diagnosis of psychosis of any cause including affective (36%), nonaffective (21%) and unspecified psychosis (43%) (N=14) and controls defined as having no personal history of psychosis (N=9). The mean age at the time of evaluation for cases was 12.29 ( +/- 2.05) and for controls was 12.89 (+/- 2.57) years. Gender assigned at birth was 50% female for cases and 44% female for controls. Clinical assessment for psychosis included the Kiddie Schedule for Affective Disorders and Schizophrenia (KSADS) and the Structured Interview for Psychosis-Risk Syndromes (SIPS). The Story Game was utilized to collect samples specifically for language analysis as it has been validated to elicit expression of thought disorder in children. Interviews were recorded via remote video conferencing and manually transcribed verbatim by a medical transcription service. Part of speech tags were assigned using the Natural Language Toolkit (NLTK) in Python. Analysis included all available transcribed language samples for a participant including KSADS, SIPS and Story Game. A binary logistic regression model was fitted with case status (presence or absence of psychosis) as the dependent variable, using an exploratory forward selection stepwise procedure to determine the predictors in the model. All 35 parts of speech categorized by NLTK were included in the model selection analysis. Additionally, age and gender assigned at birth were included in the model but were found to have no impact on model performance.

Results

The mean length of recordings was 149 (+/- 78) minutes per participant. The extended length of the transcripts is important as analysis in adolescents in CHR noted limitations in NLP analysis of transcribed speech samples due to limited speech production in a relatively short interview using only the Story Game. A 3-predictor binary logistic regression model resulted in a 78.3% classification accuracy distinguishing cases of childhood onset psychosis from controls. The 3-predictor model included the NLTK tags MD (modal verbs), RBS (adverb, superlative best) and WRB (wh-adverb where, when).

Conclusion

We present the first NLP based analysis of extensive language samples derived from a broad psychosis phenotype in children and adolescents. A binary logistic regression model based on only 3 part of speech tag predictors was able to classify childhood onset psychosis cases from controls with reasonable accuracy. The predictors in the model including modal verbs which convey ambiguity in describing things that may be possible (i.e. can, could, may, might) and adverbs (including where, when and superlatives) have previously been noted to occur at altered frequencies in schizophrenia.


    def get_column_values(lookup_string, column_name):
        # Read the CSV file into a DataFrame
        speaker = pd.read_csv('****')
        
        # Find the row(s) that contains the lookup string in the 'File name' column
        matching_rows = speaker.loc[speaker['File name'] == lookup_string]
        
        # Extract the values from the specified column in the matching rows
        column_values = matching_rows[column_name].values
        
        return column_values
    
    # Function to replace punctuation with spaces
    def replace_punctuation_with_space(text):
        translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
        return text.translate(translator)
    def parse_file_path(file_path):
        # Extract the file name using os.path.basename
        file_name = os.path.basename(file_path)
        return file_name
    # Function to perform POS tagging and count tags
    def count_pos_tags(file_path):
        # Read CSV file into DataFrame
        df = pd.read_csv(file_path)
        file_name = parse_file_path(file_path)
    
        # Create a new column 'Speaker' which identifies if a row contains speaker information
        df['Speaker'] = df['Transcription details:'].apply(lambda x: x if x.startswith('S') else None)
    
        # Forward-fill the speaker information to align with the corresponding dialogue rows
        df['Speaker'] = df['Speaker'].ffill()
    
        # Remove rows that originally don't contained the speaker information
        df = df[df['Transcription details:'].apply(lambda x: not x.startswith('S'))]
    
        df = df[df['Speaker'].notna()]
            
        # Filter for specific speaker, assuming it's 'S2 ex'
        file_name_without_extension = file_name.replace(".csv", "")
        s = get_column_values(file_name_without_extension, "Participant")
        
        pos_counts = {
        'CC': 0, 'CD': 0, 'DT': 0, 'EX': 0, 'FW': 0, 'IN': 0,
        'JJ': 0, 'JJR': 0, 'JJS': 0, 'LS': 0, 'MD': 0, 'NN': 0,
        'NNS': 0, 'NNP': 0, 'NNPS': 0, 'PDT': 0, 'POS': 0, 'PRP': 0,
        'PRP$': 0, 'RB': 0, 'RBR': 0, 'RBS': 0, 'RP': 0, 'TO': 0,
        'UH': 0, 'VB': 0, 'VBD': 0, 'VBG': 0, 'VBN': 0, 'VBP': 0,
        'VBZ': 0, 'WDT': 0, 'WP': 0, 'WP$': 0, 'WRB': 0
            }
    
        #def get_column_values(lookup_string, column_name):
    # Read the CSV file into a DataFrame
    speaker = pd.read_csv('******')
    
    # Find the row(s) that contains the lookup string in the 'File name' column
    matching_rows = speaker.loc[speaker['File name'] == lookup_string]
    
    # Extract the values from the specified column in the matching rows
    column_values = matching_rows[column_name].values
    
    return column_values

# Function to replace punctuation with spaces
def replace_punctuation_with_space(text):
    translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
    return text.translate(translator)
def parse_file_path(file_path):
    # Extract the file name using os.path.basename
    file_name = os.path.basename(file_path)
    return file_name
# Function to perform POS tagging and count tags
def count_pos_tags(file_path):
    # Read CSV file into DataFrame
    df = pd.read_csv(file_path)
    file_name = parse_file_path(file_path)

    # Create a new column 'Speaker' which identifies if a row contains speaker information
    df['Speaker'] = df['Transcription details:'].apply(lambda x: x if x.startswith('S') else None)

    # Forward-fill the speaker information to align with the corresponding dialogue rows
    df['Speaker'] = df['Speaker'].ffill()

    # Remove rows that originally contained the speaker information
    df = df[df['Transcription details:'].apply(lambda x: not x.startswith('S'))]

    df = df[df['Speaker'].notna()]
        
    # Filter for specific speaker, assuming it's 'S2 ex'
    file_name_without_extension = file_name.replace(".csv", "")
    s = get_column_values(file_name_without_extension, "Participant")
    
    pos_counts = {
    'CC': 0, 'CD': 0, 'DT': 0, 'EX': 0, 'FW': 0, 'IN': 0,
    'JJ': 0, 'JJR': 0, 'JJS': 0, 'LS': 0, 'MD': 0, 'NN': 0,
    'NNS': 0, 'NNP': 0, 'NNPS': 0, 'PDT': 0, 'POS': 0, 'PRP': 0,
    'PRP$': 0, 'RB': 0, 'RBR': 0, 'RBS': 0, 'RP': 0, 'TO': 0,
    'UH': 0, 'VB': 0, 'VBD': 0, 'VBG': 0, 'VBN': 0, 'VBP': 0,
    'VBZ': 0, 'WDT': 0, 'WP': 0, 'WP$': 0, 'WRB': 0
        }

    #One Speaker
    if len(s) > 0 and not (len(s) == 1 and pd.isna(s[0])):
        df = df[df['Speaker'].apply(lambda x: any(x.startswith(prefix) for prefix in s))]
    else:
        # Default to filtering rows where 'Speaker' starts with 'S2'
        return parse_file_path(file_path), pos_counts
    

    # Concatenate all the text from 'Transcription details:' column
    result = ''.join(df['Transcription details:'])

    # Preprocess text: replace punctuation with spaces
    result = replace_punctuation_with_space(result)

    # Perform POS tagging
    tokens = word_tokenize(result)
    tagged = nltk.pos_tag(tokens)

    # Count POS tags
    for word, tag in tagged:
        if tag in pos_counts:
            pos_counts[tag] += 1

    # Print counts

    # Example usage
    # for tag, count in pos_counts.items():
    #  print(f"{tag}: {count}")

    return parse_file_path(file_path), pos_counts

# Function to list files in a folder
def list_files_in_folder(folder_path):
    files_list = []
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        if os.path.isfile(file_path):
            files_list.append(file_path)
    return files_list

# Example usage:
folder_path = '*******'  # Replace with your folder path
files = list_files_in_folder(folder_path)

# Process each file in the folder
# Initialize an empty list to store the results
results = []

# Process each file in the folder
for file in files:
    file_name, pos_counts = count_pos_tags(file)
    pos_counts['File'] = file_name  # Add the file name to the dictionary
    results.append(pos_counts)

# Create a DataFrame from the results list
df = pd.DataFrame(results)

# Move the 'File' column to the front
df = df[['File'] + [col for col in df.columns if col != 'File']]




df.to_csv('****', index=False)
        if len(s) > 0 and not (len(s) == 1 and pd.isna(s[0])):
            df = df[df['Speaker'].apply(lambda x: any(x.startswith(prefix) for prefix in s))]
        else:
            # Default to filtering rows where 'Speaker' starts with 'S2'
            return parse_file_path(file_path), pos_counts
        
    
        # Concatenate all the text from 'Transcription details:' column
        result = ''.join(df['Transcription details:'])
    
        # Preprocess text: replace punctuation with spaces
        result = replace_punctuation_with_space(result)
    
        # Perform POS tagging
        tokens = word_tokenize(result)
        tagged = nltk.pos_tag(tokens)
    
        # Count POS tags
        for word, tag in tagged:
            if tag in pos_counts:
                pos_counts[tag] += 1
    
    
        return parse_file_path(file_path), pos_counts
    
    # Function to list files in a folder
    def list_files_in_folder(folder_path):
        files_list = []
        for file_name in os.listdir(folder_path):
            file_path = os.path.join(folder_path, file_name)
            if os.path.isfile(file_path):
                files_list.append(file_path)
        return files_list
    
    # Example usage:
    folder_path = '******'  # Replace with your folder path
    files = list_files_in_folder(folder_path)
    
    # Initialize an empty list to store the results
    results = []
    
    # Process each file in the folder
    for file in files:
        file_name, pos_counts = count_pos_tags(file)
        pos_counts['File'] = file_name  # Add the file name to the dictionary
        results.append(pos_counts)
    
    # Create a DataFrame from the results list
    df = pd.DataFrame(results)
    
    # Move the 'File' column to the front
    df = df[['File'] + [col for col in df.columns if col != 'File']]
    
    df.to_csv('final_analysis.csv', index=False)
    

Summary

This project uses natural language processing (NLP) to analyze part-of-speech (POS) patterns in transcripts from children with childhood onset psychosis. I developed a program to count specific POS tags in each transcript, helping to identify linguistic markers that may predict psychosis. The goal is to enable earlier diagnosis, which could have a significant impact on improving children's futures. I’m thrilled by the potential this research holds for advancing mental health care for young people.

Heartbeat Detector Arduino


    #include 
    #include "MAX30105.h"
    #include "heartRate.h"
    #include 
    #include 
                
    MAX30105 particleSensor;
                
    // Define the OLED display size
    #define SCREEN_WIDTH 128
    #define SCREEN_HEIGHT 64
    Adafruit_SSD1306 display(SCREEN_WIDTH, SCREEN_HEIGHT, &Wire, -1);
                
    const byte RATE_SIZE = 4; //Increase this for more averaging. 4 is good.
    byte rates[RATE_SIZE]; //Array of heart rates
    byte rateSpot = 0;
    long lastBeat = 0; //Time at which the last beat occurred
                
    float beatsPerMinute;
    int beatAvg;
                
    void setup() {
        Serial.begin(115200);
        Serial.println("Initializing...");
                
        // Initialize the screen
        if (!display.begin(SSD1306_SWITCHCAPVCC, 0x3C)) { // Address 0x3C for most SSD1306 displays
            Serial.println("SSD1306 allocation failed");
            while (1);
        }
        display.display();
        delay(2000);
        display.clearDisplay();
                  
        // Initialize sensor
        if (!particleSensor.begin(Wire, I2C_SPEED_FAST)) {
            Serial.println("MAX30102 was not found. Please check wiring/power. ");
            while (1);
        }
        Serial.println("Place your index finger on the sensor with steady pressure.");
                
        particleSensor.setup(); //Configure sensor with default settings
        particleSensor.setPulseAmplitudeRed(0x0A); //Turn Red LED to low to indicate sensor is running
        particleSensor.setPulseAmplitudeGreen(0); //Turn off Green LED
    }
                
    void loop() {
        long irValue = particleSensor.getIR();
                
        if (checkForBeat(irValue) == true) {
            //We sensed a beat!
            long delta = millis() - lastBeat;
            lastBeat = millis();
                
            beatsPerMinute = 60 / (delta / 1000.0);
                
            if (beatsPerMinute < 255 && beatsPerMinute > 20) {
                rates[rateSpot++] = (byte)beatsPerMinute; //Store this reading in the array
                rateSpot %= RATE_SIZE; //Wrap variable
                
                //Take average of readings
                beatAvg = 0;
                for (byte x = 0 ; x < RATE_SIZE ; x++)
                    beatAvg += rates[x];
                    beatAvg /= RATE_SIZE;
                }
            }
                
            // Display the results
            display.clearDisplay();
            display.setTextSize(1);
            display.setTextColor(SSD1306_WHITE);
            display.setCursor(0,0);
                
            display.print("IR=");
            display.println(irValue);
            display.print("BPM=");
            display.println(beatsPerMinute);
            display.print("Avg BPM=");
            display.println(beatAvg);
                  
            if (irValue < 50000)
                display.println("No finger?");
                
            display.display();
                
            // Serial output for debugging
            Serial.print("IR=");
            Serial.print(irValue);
            Serial.print(", BPM=");
            Serial.print(beatsPerMinute);
            Serial.print(", Avg BPM=");
            Serial.print(beatAvg);
                
                if (irValue < 50000)
                Serial.print(" No finger?");
                  
                Serial.println();
        }
                
        

Summary

I used a heartbeat detector, a screen, and an Arduino, learning how to solder components and work with basic electrical circuits. This project gave me hands-on experience in both hardware assembly and coding. Seeing the device successfully display heart rate data was a rewarding moment, as I brought a functional system into the real world. These small but impactful projects fuel my passion for electronics and give me a great sense of pride in my work.

Palantir Technologies Financial Model

Summary

I decided to create a simple financial model for Palantir after hearing they joined the S&P 500. I looked back over the past year, analyzing quarterly data and projecting a 10% annual revenue growth. I calculated the net present value (NPV), divided by the share count, and found the stock appeared slightly overvalued. However, I noticed that the S&P news could drive increased trading volume. Through this project, I also learned about the high gross profit margins typical in tech, which made the analysis interesting.