R scripts for removing duplicates and formatting XML

I have been working on this for the past couple of weeks and wanted to post the scripts. At PLoS (where the corpuses come from for this project) articles can be tagged in multiple subject areas. Hence I found 67 overlapping articles in the Math and Ecology corpuses I built. I wanted to write a function to remove them. Stacy recommended also saving one copy for the final analysis to see where these co-tagged articles are located in the spatial visualization of topics (i.e. in the center, dispersed throughout, etc.) These functions don’t account for saving a copy, but I plan to add that soon. I apologize for how difficult this is to read (below) and am happy to share files.

# Sarah Clark, 4 May 2012
# Function to remove duplicate files in corpus
## Function to write vector with all file names of duplicate files
## ex. exAllDupes <- dupeFilenames(allNames, ‘C:/Users/Sarah/Desktop/pone’)
## TODO: check if dir exists, if not create. And why the ‘NAs’ gah

dupeFilenames <- function(fileVec, printDir) {
repVec <- NA
for(i in 1:length(fileVec)) {
grepCount <- grep(fileVec[i], fileVec)
if(length(grepCount) >= 2) {
repVec[i] <- as.character(fileVec[i])
}
}
repVec <- unique(repVec)
repVec <- sort(repVec)
cat(repVec, file=file.path(printDir, ‘dupesList.txt’), sep=’\n’)
return(repVec)
}
## Function to create new filenames vector without dupes (important for scanning files later)
## print to file in ascending URL order, appending appropriate future folder reference to front for scanning
## This function was unnecessary could have made a for loop through all files rather than vector of names
## ex. exMath <- dupeFreeFilenames(mathNames, ‘Mathematics’)
## TODO: check if dir’s exist, if not create
## subject area is full name of subject, e.g. ‘ecology’ not ‘ecol’
dupeFreeFilenames <- function(filenames, subjectArea) {
files <- length(filenames)
repVec <- scan(file=file.path(‘dupesList.txt’), what=’char’, sep=’\n’)
for(i in 1:length(repVec)) {
filenames[filenames == repVec[i]] <- ‘xxx’
}
filenamesNew <- filenames[which(filenames!=’xxx’)]
filenamesNew <- sort(filenamesNew, decreasing=FALSE)
filenamesNew <- gsub(“^”, paste(tolower(subjectArea), ‘DupesFree’, ‘.’, sep=”), filenamesNew)
cat(filenamesNew, file=file.path(dir=’.’, paste(tolower(subjectArea), ‘FilenamesVectorDupeFree’, ‘.txt’, sep = ”)), sep=’\n’, append=FALSE)
return(filenamesNew)
}

 

## Wrapper function to create new folder to copy non-duped text files into, preserving back-up of original files in main folder
## ex. dupeFreeFolder(ecolNames, ‘Ecology’)
## creates Dir, ‘filenames’ should be DupesFree/filenamesNew form rpevious function
## subject area is full name of subject, e.g. ‘ecology’ not ‘ecol’
## filenames = orignal vector, mathNames

dupeFreeFolder <- function(filenames, subjectArea, dir=’.’) {
filenames <- dupeFreeFilenames(filenames, subjectArea)
filenamesOrig <- sub(‘^.*?[.]’, ”, filenames)
to_dir <- paste(tolower(subjectArea), ‘DupesFree’, sep=”)
dir.create(to_dir)
for(i in 1:length(filenames)) {
file.copy(file.path(dir, tolower(subjectArea), filenamesOrig[i]),
file.path(dir, to_dir, filenames[i]))
}
}

 

## Run functions

setwd(‘C:/Users/Sarah/Desktop/pone’)

## Scan names of downloaded, uncleaned ecol/math articles
## And remove file directory link

## TODO: put this in the function

ecolNames <- scan(file=’C:/Users/Sarah/Desktop/pone/FilenamesVectorEcol.txt’, what=’char’, sep=’\n’)
ecolNames <- gsub(“ecology/”, “”, ecolNames)

mathNames <- scan(file=’C:/Users/Sarah/Desktop/pone/FilenamesVectorMath.txt’, what=’char’, sep=”\n”)
mathNames <- gsub(“mathematics/”, “”, mathNames)

allNames <- c(ecolNames, mathNames)

dupeFilenames(allNames, ‘C:/Users/Sarah/Desktop/pone’)

# run wrapper function
# note lower or upper case is okay for subject area

dupeFreeFolder(mathNames, ‘Mathematics’)
dupeFreeFolder(ecolNames, ‘ecology’)

 

## clean-up for PONE with corrected file names to sort and no dupes
# 12 May 2012

# set working directory to where corpus is stored
setwd(‘C:/Users/Sarah/Desktop/pone’)
## universal cleanup XML w/ subject area
## ex. cleanupXML(filenamesMath, mathematics)
## Full implementation in wrapper ‘FreeAndClean’ below

cleanupXML <- function(filenames, subjectArea) {

folderPath <- paste(tolower(subjectArea), ‘DupesFree’, sep=”)
for (filename in filenames) {

# Scan in the file: reads in the text characters, making a new
# vector after every newline (\n) character
xml <- scan(file.path(folderPath, filename), what=”char”, sep=”\n”)

# Join all the broken-up newlines (separate vectors) into a
# single line/vector for processing, separated by spaces
h1 <- paste(xml, collapse=” “)

# Extract the body of the paper
h2 <- sub(“^.*<body>”, “”, h1)
h3 <- sub(“</body>.*$”, “”, h2)

# Erase in-text numerical citations, including brackets and to double-digits
h4 <- gsub(“\\[[0-9]\\]|\\[[1-9][0-9]\\]”, “”, h3)

# Erase in-text parenthetical lists, alpha and numeric, leaving text
# i.e. (a) xxx (b) xxx or (1) xxx (2) xxx
h5 <- gsub(“\\([a-z]\\)”, ” “, h4)
h6 <- gsub(“\\([0-9]\\)|\\([1-9][0-9]\\)”, ” “, h5)

# Erase electronic references
h7 <- gsub(“<xref.*?>.*</xref>”, ” “, h6)

# Erase Figures in XML while keeping their captions
h8 <- gsub(“<fig.*?>.*<caption>”, ” “, h7)
h9 <- gsub(“</caption>.*?</fig>”, ” “, h8)

# Erase the XML tags
#A ^ inside brackets negates the expression
h10 <- gsub(“<[^>]*>”, ” “, h9)

# Erase any punctuation using space so that hyphen and slashed words do not merge
# \\ escapes special characters
h11 <- gsub(“[\\!\\\”#\\$%&’\\(\\)\\*\\+,-./:;<=>\\?@\\^_`\\{|\\}~.]”, ” “, h10)

# Erase URLs
h12 <- gsub(“www[A-Za-z]*”, ” “, h11)

# Keep text only
h13 <- gsub(“[^A-Za-z]”, ” “, h12)

# Remove extra whitespace
h14 <- gsub(” +\\s”, ” “, h13)

# Print each cleaned-up article as separate plain text file into 2 directories
dirNameSubject <- paste(tolower(subjectArea), ‘DupeFreeClean’, sep=”)
dir.create(dirNameSubject)
dirNameAll <- ‘allDupeFreeClean’
dir.create(dirNameAll)
outfile <- sub(“[.]xml$”, paste(“-clean.txt”, sep=”), filename)
cat(h14, file=file.path(dirNameSubject, outfile), sep=’\n’)
# Be careful of append if really want to overwrite whole file (delete first)
cat(h14, file=file.path(dirNameAll,outfile), sep=’\n’, append=TRUE)
}
invisible(NULL)

}

 

## Function to fully implement reading in dupes free corpus’, clean-up XML,
## and print out to ‘FreeAndClean’ folders as well as ‘All’ folder that holds
## all cleaned and de-duped articles

# Process full text XML from DupesFree folders
# filenames are the dupefree vector filenames

dupeFreeAndClean <- function(subjectArea, dir=’.’) {
# create vector name for where filenames are stored
filenameTxt <- paste(tolower(subjectArea), ‘FilenamesVectorDupeFree.txt’, sep=”)
# scan in the text file storing filenames for subject areas in the DupesFree folder
filenamesVec <- scan(file.path(dir, filenameTxt), what=’char’, sep=’\n’)
# run clean-up function
cleanupXML(filenamesVec, subjectArea)
}
# Run wrapper
dupeFreeAndClean(‘ecology’)
dupeFreeAndClean(‘mathematics’)

 

Posted in Uncategorized | Leave a comment

Math and Ecology LDA Results

Some results of creating topics within just the Math corpus and just the Ecology corpus. At first glance they seem pretty similar, which is starting to confirm my suspicion that creating two corpus from the same journal may be a problem for differentiating math and ecology.

Posted in Uncategorized | Leave a comment

Silly Mistakes and LDA Results!

First, a more interesting topic: topics! The LDA results for the DataONE corpus, comparing the math and ecology articles:

LDA Results

The LDA ran on my laptop. The silly mistake occurred to Stacy at NCEAS last week, when she noted that I was running the analysis from my USB drive. I have been carrying the data around so much I didn’t even think to copy it to my desktop, so I did in the VisLab at NCEAS. I still didn’t make the connection that I should try taking the data off my USB at home until some unconscious realization hit me last night. Left the LDA running overnight and by morning it was done. Let that be a warning to all…silly mistakes.

Right now the next steps include developing a news corpus for comparison, but before that I would love to run some visualization scripts on these results to see if the math and ecology articles separate out. That actually does seem like a good first analysis because if the topics for math and ecology overlap I may need to go back in and create more/less topics (there are 25 right now).

Great! What a wonderful thing to wake-up to.

* As a fun sidenote, compare the lenght of the script to run the LDA versus the previous scripts for pre-processing (below). So much shorter, a big statement on where the heavy lifting lies in text analysis.

#Working with just PONE journal articles, comparing ecology and math to each other
#TODO: news corpus

library(tm)
library(topicmodels)
library(lasso2)

corpus <- Corpus(DirSource(“C:/Users/Sarah/Desktop/DataONE/pone/All”))
# TODO: incomplete final line issue

# corpus clean-up (tm_map)
# make all lowercase and remove common words
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords(“english”))

dtm <- DocumentTermMatrix(corpus)
lda <- LDA(dtm, 25, method = “Gibbs”)

terms(lda)

Posted in Uncategorized | Leave a comment

Stopwords, Sparse Terms & LDA

Back in the saddle. I worked at NCEAS today hoping the VisLab computer could run the LDA without memory issues. In the process I remembered I needed to update the clean-up code for the corpus. This included removing the tm package’s dictionary of English stopwords. Stopwords are just very common words such as ‘the’ and ‘for’ that provide minimal content to a paper. I also looked at removing sparse terms with Stacy, and with her advice based on her own work decided to only remove terms that occurred in less than 1% of the papers in the corpus. The current corpus I am working on has 2,469 documents so that corresponds to less than 25 articles.  This can be adjusted in the future.

Happily the document term matrix (dtm) mapped itself smoothly. However the removeStopwords function did not perform and I am in the midst of working through what bugs may be going on. One forum discussion was visited by Ingo Feinerer (the developer of the package) who made reference to some bugs and an updated package. I will need to look into that.

After removing sparse terms the number of terms in the dtm went from 97,968 down to 9,219. Upon investigation of the dtm there were a large number of nonsensical terms that were most likely cut out. How efficient! I was extremely pleased that this function ran and worked correctly, it takes a lot of worry about intensive pre-processing of the documents off of my mind.

The most interesting part! The findFreqTerms function worked, which is what revealed that stopwords were not being removed. After looking for words that occurred at least 10,000 times (yes 10,000) in the corpus 21 words were spit back. These included ‘the’, ‘for’ and ‘this’, which should have already been removed. However some meaningful words also came back such as ‘analysis’, ‘data’ and ‘population’. I found this thrilling…mostly due to the fact that it indicates some kind of meaningful clustering of topics will most likely occur later down the road.

Currently the LDA is chugging away, who knows if it will work or finish before I need to catch the bus. If not I’ll be back to NCEAS to give it a good 24 hours to crunch the dtm.

Posted in Uncategorized | Leave a comment

Memory bust in the DTM

My Masters thesis is due this quarter, hence the paucity of posts. However last night I decided to re-run the clean-up script on the DataONE articles which went quickly. So I created the DTM and left my laptop running all night to do the LDA.

I woke up and it had worked! The LDA ran. Unfortunately 24 of the 25 topics were the word “the”. The other word was “species” (interesting in its own right). Upon inspecting the DTM I found the following error message:

Error: cannot allocate vector of size 1.8 Gb
In addition: Warning messages:
1: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 1979Mb: see help(memory.size)

I’m not sure what this means but there is clearly some problem with the DTM that caused the LDA to allocate strange topics. I will need to spend more time on that aspect of the project in order to get the LDA to run properly.

Posted in Uncategorized | Leave a comment

Revised XML R Clean-up Script

Below is the most recent script to clean-up the XML documents. It corrects for a number of previous issues. This script corrects for words merging together that once had hyphens or backslashes, “greedy” deleting of everything after a certain tag, and adds in the ability to remove embedded figures while saving the figure’s caption.

# WORKING 11.30.11
# A function to loop over a set of filenames, and for each one clean-up
# the full text of a PLoSONE journal article in XML format
# TODO: does not store metadata (e.g. authors, date, etc.); erasing numbers
# may affect Mathematics Corpus; subscript and superscripts become nonsensical, leaving in for now

cleanupXML <- function(filenames, output.dir=”.”) {

# if output directory doesn’t exist, make it
checkDirectory(output.dir)

for (filename in filenames) {

# Scan in the file: reads in the text characters, making a new
# vector after every newline (\n) character
xml <- scan(file=filename, what=”char”, sep=”\n”)

# Join all the broken-up newlines (separate vectors) into a
# single line/vector for processing, separated by spaces
h1 <- paste(xml, collapse=” “)

# Extract the body of the paper
h2 <- sub(“^.*<body>”, “”, h1)
h3 <- sub(“</body>.*$”, “”, h2)

# Erase in-text numerical citations, including brackets and to double-digits
h4 <- gsub(“\\[[0-9]\\]|\\[[1-9][0-9]\\]”, “”, h3)

# Erase in-text parenthetical lists, alpha and numeric, leaving text
# i.e. (a) xxx (b) xxx or (1) xxx (2) xxx
h5 <- gsub(“\\([a-z]\\)”, ” “, h4)
h6 <- gsub(“\\([0-9]\\)|\\([1-9][0-9]\\)”, ” “, h5)

# Erase electronic references
h7 <- gsub(“<xref.*?>.*</xref>”, ” “, h6)

# Erase Figures in XML while keeping their captions
h9 <- gsub(“<fig.*?>.*<caption>”, ” “, h8)
h10 <- gsub(“</caption>.*?</fig>”, ” “, h9)

# Erase the XML tags
#A ^ inside brackets negates the expression
h9 <- gsub(“<[^>]*>”, ” “, h8)

# Erase any punctuation using space so that hyphen and slashed words do not merge
# \\ escapes special characters
h10 <- gsub(“[\\!\\\”#\\$%&’\\(\\)\\*\\+,-./:;<=>\\?@\\^_`\\{|\\}~.]”, ” “, h9)

# Erase URLs
h11 <- gsub(“www[A-Za-z]*”, ” “, h10)

# Keep text only
h12 <- gsub(“[^A-Za-z]”, ” “, h11)

# Remove extra whitespace
h13 <- gsub(” +\\s”, ” “, h12)

# Print each cleaned-up article as separate plain text file
outfile <- sub(“[.]xml$”, “-clean.txt”, basename(filename))
cat(h13, file=file.path(output.dir, outfile))

}
invisible(NULL)

}

Posted in Uncategorized | Leave a comment

‘topicmodels’ freezing

The package ‘topicmodels’ appeared to be freezing when I tried to run the LDA, even after installing the latest version of R. However Jim looked at it and explained it was actually referencing some C code internally that doesn’t register with R when running (i.e. gives ‘Not Responding’ message). Regardless it is taking a while to run and Jim has it going in his background for the day to see what happens.

I just came across a potentially related post referencing an LDA analysis where the document term matrix is too large. I am not sure if that is what is going on here but I might try to figure out solutions in that message string.

In the meantime I am fixing the code to clean-up the XML files. I removed all of the non-alpha characters but that merged a bunch of hyphenated words, and words connected by slashes so I am working backwards from there. Working through also getting rid of URL references and nonsensical strings of characters caused by figures in the text.

Posted in Uncategorized | Leave a comment