Get existed keywords from full-text article

* Purpose

Extract specific keywords from body of an article and output existed keywords, PubMed ID and PubMed Central ID.
全文から特定のキーワードを検索してPubMed ID、PubMedCentral IDとセットでアウトプットする。

* How

Download full-text XMLs data of articles from PubMedCentral and use Element.tree library of python for text-mining.
PubMedCentralから論文全文のXMLファイルを取得し、PythonのElement.treeライブラリーでテキストマイニングを行う。

* Tool

Python 2.7.3
(NOTE: Python 2.4 didn't have Element.tree library)

* Files

- all_pmcid.txt: get
- words.list : list of keywords

1. Get a full-text article in XML text file from PubMed Central. 

See the previous post.
http://bioinfomemo.blogspot.jp/2013/10/get-full-text-article-from-pubmed.html


2. Exclude all character decoration

The xml.etree will stop reading when there is a text decoration tag inside paragraph.
--------------------Shell script---------------------
#!/bin/bash

ls -1 xml/* | while read line
do
cat ${line} |
sed -e 's/<ext-link [^>]*>//g' | sed -e 's/<ext-link>//g' | sed -e 's/<\/ext-link>//g' | 
sed -e 's/<xref [^>]*>//g' | sed -e 's/<xref>//g' | sed -e 's/<\/xref>//g' | 
sed -e 's/<bold>//g' | sed -e 's/<\/bold>//g' | 
sed -e 's/<italic>//g' | sed -e 's/<\/italic>//g' |  
sed -e 's/<sup>//g' | sed -e 's/<\/sup>//g' | 
sed -e 's/<p [^>]*>//g' | sed -e 's/<p>//g'| sed -e 's/<\/p>//g' | 
sed -e 's/<supplementary-material [^>]*>//g' | sed -e 's/<supplementary-material>//g' |  sed -e 's/<\/supplementary-material>//g' | 
sed -e 's/<title>//g' |  sed -e 's/<\/title>//g' | 
sed -e 's/<caption>//g' | sed -e 's/<\/caption>//g' | 
sed -e 's/<media [^>]*>//g' | sed -e 's/<media>//g' | sed -e 's/<\/media>//g'  |
sed -e 's/<sec [^>]*>//g' | sed -e 's/<sec>//g' | sed -e 's/<\/sec>//g' | 
sed -e 's/<table-wrap [^>]*>//g' | sed -e 's/<table-wrap>//g' |  sed -e 's/<\/table-wrap>//g' | 
sed -e 's/<table-wrap-foot [^>]*>//g' | sed -e 's/<table-wrap-foot>//g' | sed -e 's/<\/table-wrap-foot>//g' |
sed -e 's/<table [^>]*>//g' | sed -e 's/<table>//g' | sed -e 's/<\/table>//g' | 
sed -e 's/<label [^>]*>//g' | sed -e 's/<label>//g' | sed -e 's/<\/label>//g' | 
sed -e 's/<thead [^>]*>//g' | sed -e 's/<thead>//g' | sed -e 's/<\/thead>//g' | 
sed -e 's/<tbody [^>]*>//g' | sed -e 's/<tbody>//g' | sed -e 's/<\/tbody>//g' | 
sed -e 's/<tr [^>]*>//g' | sed -e 's/<tr>//g' | sed -e 's/<\/tr>//g' | 
sed -e 's/<th [^>]*>//g' | sed -e 's/<th>//g' | sed -e 's/<\/th>//g' | 
sed -e 's/<td [^>]*>//g' | sed -e 's/<td>//g' | sed -e 's/<\/td>//g' | 
sed -e 's/<fn [^>]*>//g' | sed -e 's/<fn>//g' | sed -e 's/<\/fn>//g' | 
sed -e 's/<fig [^>]*>//g' | sed -e 's/<fig>//g' | sed -e 's/<\/fig>//g' > del_tag/$line
done
---------------------------------------------------
NOTE: Validate XML format


3. Assemble xml files to one file and reformat for ElementTree library. 

(1) Delete <!DOCTYPE> and add new line at end of text of all XML files. Merge all XML files to the one file  and rap with <articleset></articleset> and add <!DOCTYPE> at first line.

--------------------Shell script---------------------
#!/bin/bash

cd del_tag/xml/;

cat  PMC*******.xml | head -1 > header.txt;

ls -1 del_tag/xml/* | while read line
do
cat ${line} | sed -e '$s/$/\n/' | sed -e '1d' > ins_end/${line}
done

cat ins_end/* > all.xml
sed -i -e '1i\<articleset>' all.xml
sed -i '$s/$/<\/articleset>/' all.xml
cat header.xml all.xml > all2.xml
---------------------------------------------------

4. Text-mining by python Element.tree.

----------------------Python-----------------------

### SPLIT KEYWORDS
f=open('words.list','r')
words_list=f.read().splitlines()

### EXTRACT PMCID, PMID AND BODY TEXT FROM XML
from xml.etree import ElementTree
XMLFILE = "full.xml"
tree = ElementTree.parse(XMLFILE)
root = tree.getroot()
art=[]
for e in root.getiterator("article"):
 p = e.find('.//body').findall('.//p')
 p_str=""
  for i in p:
   if isinstance(i.text, unicode):
    p_str += i.text.encode('utf-8')
   elif isinstance(i.text, str):
    p_str += i.text
   art.append({
    "pmcid": e.findtext('.//article-id[@pub-id-type="pmcid"]'),
    "pmid": e.findtext('.//article-id[@pub-id-type="pmid"]'),
    "text": p_str
})


### EXTRACT KEYWORDS FROM BODY TEXT
result = []
n = ""
e = ""
g = ""
for i in range(len(art)):
 for word in art[i].get('text').split():
  if word in words_list:
   n = word
   result.append({
    "pmid" : art[i].get('pmid'),
    "pmcid" : art[i].get('pmcid'),
    "word" : w
})

# output
# ['PMID', 'PMCID', 'Keyword']


### UNIQUE RESULT
seen = set()
uniq_result = []
for d in result:
 t = tuple(d.items())
 if t not in seen:
  seen.add(t)
  uniq_result.append(d)
---------------------------------------------------
Share on Google Plus

About Piyoko

    Blogger Comment
    Facebook Comment

0 コメント:

コメントを投稿